[OCWR] Week 7 - OpenCitations Weekly Report
Week from Sep 14 to Sep 20
Introduction
During the seventh week, I mainly had to think about a new implementation to keep track of the Citation
counters in an info_file_ci.txt
(similarly to what already happens for all the other types of entity). In the meanwhile, I also updated the oc_ocdm
codebase with the new
algorithm for provenance text files described in last week report.
Report
Various updates to the codebase
I fixed some wrong Type Hints in prov_entity.py
and I added the missing tests related to the ProvSet
class. I committed the new
implementation for the provenance text files described in last week report and I added the method has_next_de
to the ReferencePointer
class
for compatibility with the module jats2oc.py
.
Thinking about a new implementation for Citation counters
As of now, the OCDM states that a Citation
must be identified by its OCI. In case of an in-text citation, the identifier
is extended with an additional sequential number. The OCI is generated from the two sequential numbers that identify, respectively, the
citing and the cited BibliographicResource
entities. Since it’s not a sequential number by itself, we don’t have to keep track of it. Instead,
for each different OCI, we must keep track of the last respective in-text citation stored in the corpus (since each of them is identified
by a sequential number).
After some reasoning, I came to the conclusion that the best possible non-overcomplicated solution would have been to mantain a
info_file_ci.txt
text file containing lines similar to those shown in the following example:
123456-987654 6
1928374-902817462 10
2938-1026 1
, where each line contains both the OCI and the respective incremental counter separated by an empty space. Every line of the file should be of the same length, hence a padding technique (such as the one described in the Week 5 report) must be used.
Since the number of citations stored is very big and it’s bound to become even bigger in the future (at the time of this writing, more than 733 million citations have been stored in the OpenCitations Context Corpus!), read and write operations on this file can quickly become sources of performance issues.
Two opposite solutions came to my mind:
- new lines are added in append mode (at the end of the file). This means that the lines in the file are not ordered based on the OCIs, hence writing new lines is really quick but reading an existing line is really slow (it would require a linear search inside the entire file);
- new lines are added inside the file so that the ordering based on the OCIs is preserved. This means that we can exploit the order of the lines to “quickly” find a line by its OCI (this would require a binary search through the file), hence writing new line is really slow (since adding a line in the middle of the file requires to copy and rewrite its entire content) but reading an existing line is relatively quick.
Splitting the file into smaller chunks could be an option, but this would complicate an already complicated enough situation. At the end of the week, I can say not to have found any practical solution to this problem other than to stop using the OCI to identify citations in favour of a simple sequential number as in the case of every other entity stored in the corpus.