[OCWR] Week 16 - OpenCitations Weekly Report

Week from Nov 16 to Nov 22

Introduction

During the sixteenth week I started working harder on the ShEx validation for the OCDM-compliant RDF graphs. I also started working on a new implementation for the generate_provenance method of the class ProvSet, which is one of the fundamental goals of this project. The implementation I obtained is only a draft which will need further adjustments before its release in the oc_ocdm repository.

Report

ShEx validation

First things first, I needed a valid OCDM-compliant RDF graph to experiment with. A folder containing the output of SPACIN was sent to me via e-mail by Marilena Daquino, who is one of the main UNIBO researchers that work on the OpenCitations project.

The output data coming from SPACIN is spread across various JSON-LD files which must be parsed and joined into a single rdflib.Graph object. I then wrote a simple Python script which does all of these things automatically, recursively searching files into that folder and parsing them one by one. The result is a graph.ttl RDF file which gets stored locally and that I can use for my validation trials.

I then created another script which is a work in progress ShEx validator which will be adjusted in the next weeks. I made use of the PyShEx package, which unfortunaly is still in an alpha stage. This fact was made evident when trying for the first time to validate the graph.ttl file. I tried with two different variants of the ShExC file that I wrote last week with no success at all. If I leave the ShExC file as it is, the script stops early with a RecursionError message, while if I add the CLOSED shape constraint to every bibliographic entity, it marks as invalid almost every entity that undergoes validation (even if theoretically it shouldn’t). I’ll continue investigating these issues during the next week.

Various fixes

I’ve also done some fixes and cleanups of the oc_ocdm code. I removed some unused methods and class fields from ProvEntity which were remnants of very old code which used to reflect an older version of the OCDM specification. In particular, these methods/class fields were removed:

  • class fields iri_prov_agent, iri_association, iri_curator and iri_source_provider;
  • get_types, create_creation_activity, create_update_activity, create_merging_activity and remove_type (together with class fields iri_create, iri_modify, iri_replace and iri_activity);
  • generates (together with class field iri_was_generated_by);
  • invalidates (together with class field iri_was_invalidated_by);
  • involves_agent_with_role (together with class field iri_qualified_association);
  • has_role_type (together with class field iri_had_role);
  • has_role_in (together with class field iri_associated_agent).

Then, I fixed line endings for each file in the repository: now the entire repository follows the LF convention. The self.resp field of ProvSet was useless as its role was taken over by self.resp_agent, hence it was removed.

Finally, I fixed the implementation of both add_se and _add_prov methods of ProvSet. The _add_prov method was simply cleaned up from legacy code that allow for a snapshot being associated to multiple entities. This is no longer possible, hence the code was simplified and the new method signature accepts a single prov_subject parameter instead of the previous list_of_entities.

Besides reflecting this signature change into add_se too, this method was modified in the same way as it was done for all the add_* methods of GraphSet during the previous week, enforcing a singleton-like behaviour also for provenance snapshots.