[OCWR] Week 2 - OpenCitations Weekly Report
Week from Aug 10 to Aug 16
Introduction
The goal of the second week was to do a complete code refactoring of graphlib.py
from the CCC repository, a Python module which
reflects the OCDM (OpenCitations Data Model) and makes it possible to easily construct Python objects that can later be stored
persistently as RDF graphs through the use of storer.py
. Hence, I firstly had to analyze the module in order to understand its inner
workings.
The original graphlib.py
module defines four classes: GraphEntity
, GraphSet
, ProvEntity
and ProvSet
. The latter two classes
implement the data model section related to the provenance of the stored entities: since – at least at this stage of the project –
we are still unsure about how to improve the logic behind these classes (as it’s the main goal of the entire 5-months period), I skipped
them and I only focussed on the other ones.
Report
Code refactoring
Strictly following the latest version of the OCDM described in this document, I designed a new class hierarchy that
represents more closely all the different entities stored by OpenCitations. Previously, GraphEntity
was used as a general abstraction
of a generic entity described by the OCDM. Now, GraphEntity
is the base class of some more specific classes which inherit from it
only a handful of methods (among them, the constructor): because of this, every other method had to be moved into the right subclass
according to the OCDM definitions.
Download link: click here to download a simple UML graph that shows the new class hierarchy.
Then, it was GraphSet
’s turn: this class acts both as a collection of GraphEntity
objects and as a factory to easily instantiate them.
In fact, it provides a set of factory methods such as add_an(...)
, add_br(...)
, add_ci(...)
, etc. : while previously these methods
returned a generic GraphEntity
object, now they produce an object of the right subclass. This brings a lot of advantages to the library
since it leaves less room for errors, as it’s now impossible to call a method of class A on a class B instance, while previously every
method was defined in the same GraphEntity
class. This required some changes to the semantics of the _add(...)
and _add_ci(...)
methods and the removal of the – now useless – _generate_entity(...)
(both from GraphSet
and ProvSet
).
In the end, I created a new personal GitHub repository which contains all of the new code. It’s called
oc_ocdm
and it’s organized as a list of files, one for each Python class. It was created and initialized through the use
of Poetry, a useful tool which makes really easy to produce and publish Python packages on online repositories such as
PyPi.
In order to make this package self-contained, I had to extrapolate some external functions defined inside support/reporter.py
and
support/support.py
from the CCC repository: some classes of the package relied on those functions, so I included them in the
package inside a reporter.py
and a support.py
files. The only other external dependency left is the one from rdflib
itself.
Documentation
During the code refactoring process, I took care of writing down some comments for each method. As for now, they only act as reminders to myself, so that I can know what every method is supposed to do. Unfortunately, they are really concise and not very useful for documentation purposes. In the next days, it will be very important to enhance them, mainly transforming them into proper Python docstrings: this will also enable, in the future, automatic tools (such as PyDoc or Doxygen) to build a documentation for the library.
Example:
# HAS IDENTIFIER
# <self.res> DATACITE:hasIdentifier <id_res>
def has_id(self, id_res: Identifier) -> None:
self.g.add((self.res, GraphEntity.has_identifier, URIRef(str(id_res))))
Type hints for type checking
I chose to add Python 3.7+ type hints through the code as a way to enforce type safety inside the library. For example,
the GraphSet.add_br(...)
method signature became like this:
def add_br(self, resp_agent: str, source_agent: str = None, source: str = None,
res: URIRef = None) -> BibliographicResource:
This enables automatic tools such as MyPy to statically analize type soundness of the code, acting as an additional testing level which is able to notify the developers in cases of types misuse, which often is the root cause of unexpected problems.
Adding those annotations caused some circular dependencies to arise where two classes A and B referenced each other through type
annotations (simply reflecting a circular dependency between the two OCDM entities). Nothing to be worried about, actually, since the
involved import
statements are expected to be used only by static type checkers. I fixed this inconvenience by using the following
logic where needed:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from oc_ocdm.bibliographic_resource import BibliographicResource
(typings.TYPE_CHECKING
is always False
at runtime)
Tests
As for unit testing, I arranged a tests
folder inside the oc_ocdm repository which will contain a test module
for each oc_ocdm
’s class. I still haven’t written any test at this moment: it’s surely something that needs to be done in the
following week.