[OCWR] Week 17 - OpenCitations Weekly Report

Week from Nov 23 to Nov 29

Introduction

During the seventeenth week, I mainly worked on the ShEx validator and on the new implementation of generate_provenance.

Report

ShEx validation

I did other tests with the package PyShEx. In particular, I tried with four different variants of the ShExC file (which describes how a graph compliant with the OCDM should be shaped): in those variants I gradually relaxed more and more constraints, hoping to find a sweetspot that could allow me to validate a proper OCDM graph without any problems.

What I found this week was that only the most relaxed variant works without issues: to assert the validity of a certain entity, this variant doesn’t check if an entity referenced by the first one does itself fit into the required shape. Furthermore, it allows for extra triples not described in the OCDM specification. This of course is very limiting and forces us to look for another solution.

Adding the shape constraint also to every entity referenced by the one on focus causes a RecursionError, where the maximum recursion depth is exceeded. On the other hand, both with and without that constraint, adding the CLOSED shape constraint results in a lot of valid entities being evaluated as non valid, with hardly understandable error messages.

Anyway, I modified the validation script in order to bypass the lack of PyShEx support for QueryMap files. It now loops over all the subjects of the given rdflib.Graph, programmatically inferring their type and applying a validation step to each of them separately.

Various fixes

I fixed a type hint in the method signature of get_entity from the class GraphSet: the returned value type is now Optional because there’s the possibility for the method not to return anything.

I also changed the signature of the add_se method from the ProvSet class. The resp_agent, source_agent and source parameters are now obtained automatically from the provided prov_subject and cannot be passed as arguments anymore.

New generate_provenance implementation

A new implementation of the generate_provenance method was needed since the beginning, because the previous one was developed in a context in which only the “add a new entity to the triplestore” operation was contemplated.

The new implementation accepts as argument only a float value representing the timestamp which should be interpreted as the “current time” by the algorithm. Every other parameter was removed as it can automatically obtained by the algorithm itself.

The generate_provenance method now handles every possible situation that may occur by exploiting at its full the new oc_ocdm APIs added in the previous weeks. From now on, this method will be able to generate meaningful provenance snapshots of any OCDM entity which was created out of nothing, modified, merged with other entities or even deleted from the triplestore.

By invoking generate_provenance, a single provenance snapshot will be added to the triplestore for each considered entity. This means that the complexities of the eventual sequence of operations applied locally to the entity will be flattened into an overall most priority operation. For example:

  • if the user creates a new entity and then it deletes it before storing it into the triplestore, no snapshot will be generated;
  • on the other hand, if a user imports an existing entity into a GraphSet and then he/she applies some modifications to it while also merging it with other entities, the produced snapshot will contain information both for the edits and for the merge but it will be a “merge snapshot” since that is the higher priority operation between the two.