[OCWR] Week 5 - OpenCitations Weekly Report
Week from Aug 31 to Sep 06
Introduction
During the fifth week I continued working on the oc_ocdm project.
- I added a missing method to the
DiscourseElement
class to fix a problem found last week during the integration test with BEE and SPACIN - I continued adding missing tests, Type Hints and docstrings to the
GraphSet
,ProvSet
andProvEntity
classes - I setup and built an html documentation using Sphinx, adding every missing docstring to the source code
- I started working on the next phase of the project, which is related to the redesign of the provenance layer
Report
Various fixes
Last week, during the integration test with BEE and SPACIN, an error was found regarding a missing has_number
method from the
DiscourseElement
class, which was not documented in the OCDM. As it turned out, the documentation itself needs to be updated
with this missing property. Hence, I added the missing has_number
method into the DiscourseElement
class and I opened a new GitHub issue
(#7) in the opencitations/metadata repository.
Then, I added the missing tests related to the GraphSet
and the ProvEntity
methods and I added Type Hints and docstrings to the ProvSet
and ProvEntity
classes.
Documentation provided by Sphinx
As far as the documentation is concerned, I choosed to try Sphinx
, a tool which provides an automated way for building a documentation. It
permits to automatically extract docstrings from the code and to export the documentation in a variety of different formats, such as HTML, PDF
and LaTex. I prepared the oc_ocdm repo in such a way that any developer could locally build the documentation
while working on it and could later submit his/her contributions to the public repository. I obviously updated the README.md
file of the
project to explain how to properly work with the documentation.
I also continued to add docstrings to the project, grabbing the explanations provided for each method by the OCDM specification itself.
Looking for a technique to randomly access a line inside a text file with O(1) complexity
I started working for the first time on the provenance layer: my initial task is to find an efficient way to handle the integer counters needed to keep track of the last snapshot number of each entity stored in the OpenCitations Context Corpus. This is now done in a little convoluted way: for each type of entity (“an”, “be”, “br”, “rp”, “ci”, …), the long list of integer numbers related to each instance of it is splitted and stored in multiple text files. Each text file has a line for every instance containing the decimal representation of the actual value of the respective counter. To retrieve and – eventually – modify a line, the file is read sequentially line by line: this is not a problem as far as the size of each file is kept small.
Exploring the possibility of simplifying this mechanism by using only one text file for each type of entity, I had to find a way to efficiently retrieve a random line from a huge file, beating the performances of a sequential search. Having deeply studied the case, I came out with a simple solution that I believe should be evaluated as an alternative to the current strategy.
What’s required is a text file with the same byte length for each line. I created a Python script which generates such a file using random
numbers and writing 700 million lines of 4 ASCII characters each (the fourth is reserved for the \n
character). If the decimal representation
of the counter value is shorter than 3 chars, the remaining slots are filled with a special character (for example a blank space). Then, a line
in the middle of the file (from 1 to 700M) is randomly choosen. The script then tries to retrieve the respective counter value firstly with a
sequential search and secondly with a different algorithm that exploits the seek
method of the Python’s File
class (moving the I/O pointer
directly to line_number * line_length
byte, where the correct line starts). Some quick tests showed that the latter strategy proves to be
optimal when the file is big in size, otherwise the OS cache often covers pretty well the weaknesses of the sequential search (making it the
fastest strategy).
The “seek and read” strategy enjoys a O(1) complexity, requiring a constant access time to each line of the file, making it quite suitable for our job. As as potential drawback, particular attention must be kept by human agents that want to interact with the text file, since they must be sure not to change the length of any line as they would skew every reading from that point on.