[OCWR] Week 16 - OpenCitations Weekly Report
Week from Nov 16 to Nov 22
Introduction
During the sixteenth week I started working harder on the ShEx validation for the
OCDM-compliant RDF graphs. I also started working on a new implementation for the
generate_provenance
method of the class ProvSet
, which is one of the fundamental goals of this
project. The implementation I obtained is only a draft which will need further adjustments before
its release in the oc_ocdm repository.
Report
ShEx validation
First things first, I needed a valid OCDM-compliant RDF graph to experiment with. A
folder containing the output of SPACIN
was sent to me via e-mail by
Marilena Daquino, who is one of the main UNIBO researchers that work on the
OpenCitations project.
The output data coming from SPACIN
is spread across various JSON-LD files which must be
parsed and joined into a single rdflib.Graph
object. I then wrote a simple Python script which does
all of these things automatically, recursively searching files into that folder and parsing them one
by one. The result is a graph.ttl
RDF file which gets stored locally and that I can use for my
validation trials.
I then created another script which is a work in progress ShEx validator which will be
adjusted in the next weeks. I made use of the PyShEx package, which unfortunaly is
still in an alpha stage. This fact was made evident when trying for the first time to validate the
graph.ttl
file. I tried with two different variants of the ShExC file that I wrote last week with
no success at all. If I leave the ShExC file as it is, the script stops early with a
RecursionError
message, while if I add the CLOSED shape
constraint to every bibliographic entity,
it marks as invalid almost every entity that undergoes validation (even if theoretically it
shouldn’t). I’ll continue investigating these issues during the next week.
Various fixes
I’ve also done some fixes and cleanups of the oc_ocdm code. I removed some unused
methods and class fields from ProvEntity
which were remnants of very old code which used to reflect
an older version of the OCDM specification. In particular, these methods/class fields
were removed:
- class fields
iri_prov_agent
,iri_association
,iri_curator
andiri_source_provider
; get_types
,create_creation_activity
,create_update_activity
,create_merging_activity
andremove_type
(together with class fieldsiri_create
,iri_modify
,iri_replace
andiri_activity
);generates
(together with class fieldiri_was_generated_by
);invalidates
(together with class fieldiri_was_invalidated_by
);involves_agent_with_role
(together with class fieldiri_qualified_association
);has_role_type
(together with class fieldiri_had_role
);has_role_in
(together with class fieldiri_associated_agent
).
Then, I fixed line endings for each file in the repository: now the entire repository follows the
LF convention. The self.resp
field of ProvSet
was useless as its role was taken over by self.resp_agent
, hence it was removed.
Finally, I fixed the implementation of both add_se
and _add_prov
methods of ProvSet
.
The _add_prov
method was simply cleaned up from legacy code that allow for a snapshot being
associated to multiple entities. This is no longer possible, hence the code was simplified and the
new method signature accepts a single prov_subject
parameter instead of the previous
list_of_entities
.
Besides reflecting this signature change into add_se
too, this method was modified in the same way as it was done for all the add_*
methods of GraphSet
during the previous week, enforcing a singleton-like behaviour also for provenance snapshots.