[OCWR] Week 2 - OpenCitations Weekly Report

Week from Aug 10 to Aug 16

Introduction

The goal of the second week was to do a complete code refactoring of graphlib.py from the CCC repository, a Python module which reflects the OCDM (OpenCitations Data Model) and makes it possible to easily construct Python objects that can later be stored persistently as RDF graphs through the use of storer.py. Hence, I firstly had to analyze the module in order to understand its inner workings.

The original graphlib.py module defines four classes: GraphEntity, GraphSet, ProvEntity and ProvSet. The latter two classes implement the data model section related to the provenance of the stored entities: since – at least at this stage of the project – we are still unsure about how to improve the logic behind these classes (as it’s the main goal of the entire 5-months period), I skipped them and I only focussed on the other ones.

Report

Code refactoring

Strictly following the latest version of the OCDM described in this document, I designed a new class hierarchy that represents more closely all the different entities stored by OpenCitations. Previously, GraphEntity was used as a general abstraction of a generic entity described by the OCDM. Now, GraphEntity is the base class of some more specific classes which inherit from it only a handful of methods (among them, the constructor): because of this, every other method had to be moved into the right subclass according to the OCDM definitions.

Download link: click here to download a simple UML graph that shows the new class hierarchy.

Then, it was GraphSet’s turn: this class acts both as a collection of GraphEntity objects and as a factory to easily instantiate them. In fact, it provides a set of factory methods such as add_an(...), add_br(...), add_ci(...), etc. : while previously these methods returned a generic GraphEntity object, now they produce an object of the right subclass. This brings a lot of advantages to the library since it leaves less room for errors, as it’s now impossible to call a method of class A on a class B instance, while previously every method was defined in the same GraphEntity class. This required some changes to the semantics of the _add(...) and _add_ci(...) methods and the removal of the – now useless – _generate_entity(...) (both from GraphSet and ProvSet).

In the end, I created a new personal GitHub repository which contains all of the new code. It’s called oc_ocdm and it’s organized as a list of files, one for each Python class. It was created and initialized through the use of Poetry, a useful tool which makes really easy to produce and publish Python packages on online repositories such as PyPi.

In order to make this package self-contained, I had to extrapolate some external functions defined inside support/reporter.py and support/support.py from the CCC repository: some classes of the package relied on those functions, so I included them in the package inside a reporter.py and a support.py files. The only other external dependency left is the one from rdflib itself.

Documentation

During the code refactoring process, I took care of writing down some comments for each method. As for now, they only act as reminders to myself, so that I can know what every method is supposed to do. Unfortunately, they are really concise and not very useful for documentation purposes. In the next days, it will be very important to enhance them, mainly transforming them into proper Python docstrings: this will also enable, in the future, automatic tools (such as PyDoc or Doxygen) to build a documentation for the library.

Example:

# HAS IDENTIFIER
# <self.res> DATACITE:hasIdentifier <id_res>
def has_id(self, id_res: Identifier) -> None:
    self.g.add((self.res, GraphEntity.has_identifier, URIRef(str(id_res))))

Type hints for type checking

I chose to add Python 3.7+ type hints through the code as a way to enforce type safety inside the library. For example, the GraphSet.add_br(...) method signature became like this:

def add_br(self, resp_agent: str, source_agent: str = None, source: str = None,
            res: URIRef = None) -> BibliographicResource:

This enables automatic tools such as MyPy to statically analize type soundness of the code, acting as an additional testing level which is able to notify the developers in cases of types misuse, which often is the root cause of unexpected problems.

Adding those annotations caused some circular dependencies to arise where two classes A and B referenced each other through type annotations (simply reflecting a circular dependency between the two OCDM entities). Nothing to be worried about, actually, since the involved import statements are expected to be used only by static type checkers. I fixed this inconvenience by using the following logic where needed:

from typing import TYPE_CHECKING
if TYPE_CHECKING:
    from oc_ocdm.bibliographic_resource import BibliographicResource

(typings.TYPE_CHECKING is always False at runtime)

Tests

As for unit testing, I arranged a tests folder inside the oc_ocdm repository which will contain a test module for each oc_ocdm’s class. I still haven’t written any test at this moment: it’s surely something that needs to be done in the following week.