[OCWR] Week 8 - OpenCitations Weekly Report

Week from Sep 21 to Sep 27

Introduction

The eighth week started with the decision to change the way in which the citations are stored inside the corpus. It was decided that from now on every Citation will be identified by a sequential number (in exactly the same way as the other entities), while their OCI will be stored as an additional Identifier. This decision came from the results exposed in the report from last week, since it appears to be the best choice performance-wise. As a consequence, some modifications have been made to the codebase during the week.

Report

Various modifications to the codebase

I removed the has_next_de method from the ReferencePointer class. This method was added last week in the effort of reaching the complete compatibility with the previous external code that used to rely upon graphlib.py. It’s now necessary to update jats2oc.py for it to follow more strictly the OCDM prescriptions.

I updated README.md with a suggestion useful to avoid getting a warning during documentation generation. I fixed a test method called test_is_context_of_rp_or_pl, in which the is_context_of_rp_or_pl method from the DiscourseElement class was invoked with an argument of the wrong datatype. I updated the OCDM.pdf file (which is downloadable from here).

I organized the modules contained inside the package in a hierarchical tree structure which better reflects the class hierarchies and is much more intuitive. Thanks to the use of __init__.py files, new ways to import the oc_ocdm classes are now available. For example, the following import statement are now fully working:

from oc_ocdm import GraphSet
from oc_ocdm.support import get_short_name
from oc_ocdm.entities import Identifier, BibliographicEntity
from oc_ocdm.entities.bibliographic import AgentRole

I removed the dependency of GraphSet from the Reporter class, which was actually unused. The reporter.py module, being not used by any other module of the package, was removed from the codebase.

Consequences of the new strategy for Citation identifiers

I opened a new GitHub issue (#8) in the opencitations/metadata repository, with the goal of adding to the OCDM the new option for the Citation identifier. Then, I had to add the missing method create_oci to the Identifier class. I updated the add_ci method from GraphSet: it now internally calls _add instead of _add_ci (which was completely removed from the class). The method _add itself had to be fixed in a part related to the instantiation of provenance entities: it’s now aligned with the new strategy for provenance text files.

I also removed the dependency of ProvSet from the ResourceFinder class, which was still missing in the codebase. To achieve this result, I had to implement the new method _retrieve_last_snapshot (together with its respective test method test_retrieve_last_snapshot) which replaces the retrieve_last_snapshot and add_prov_triples_in_filesystem methods from ResourceFinder. Previously, to find the last provenance snapshot of a particular entity, a local RDF storage file containing the data related to the entity was created (through the use of add_prov_triples_in_filesystem). Then, a SPARQL query was performed on the local file (by invoking retrieve_last_snapshot), which returned the URI of the last active snapshot of the entity. Thanks to the new strategy for the Citation identifier, it’s now possible to read the actual snapshot counter value from the prov_file_XX.txt text file, since now every entity (including citations) has its provenance counters stored in that way. This obviously results in a big performance improvement over the previous implementation. At the end of the Week 6 report, I said that:

the generate_provenance method now only takes consistently around 1 to 1.5 seconds to execute with small test data

With the changes explained before, it now only takes about 0.7 to 0.8 seconds to execute the same test!