[OCWR] Week 5 - OpenCitations Weekly Report

Week from Aug 31 to Sep 06

Introduction

During the fifth week I continued working on the oc_ocdm project.

  • I added a missing method to the DiscourseElement class to fix a problem found last week during the integration test with BEE and SPACIN
  • I continued adding missing tests, Type Hints and docstrings to the GraphSet, ProvSet and ProvEntity classes
  • I setup and built an html documentation using Sphinx, adding every missing docstring to the source code
  • I started working on the next phase of the project, which is related to the redesign of the provenance layer

Report

Various fixes

Last week, during the integration test with BEE and SPACIN, an error was found regarding a missing has_number method from the DiscourseElement class, which was not documented in the OCDM. As it turned out, the documentation itself needs to be updated with this missing property. Hence, I added the missing has_number method into the DiscourseElement class and I opened a new GitHub issue (#7) in the opencitations/metadata repository.

Then, I added the missing tests related to the GraphSet and the ProvEntity methods and I added Type Hints and docstrings to the ProvSet and ProvEntity classes.

Documentation provided by Sphinx

As far as the documentation is concerned, I choosed to try Sphinx, a tool which provides an automated way for building a documentation. It permits to automatically extract docstrings from the code and to export the documentation in a variety of different formats, such as HTML, PDF and LaTex. I prepared the oc_ocdm repo in such a way that any developer could locally build the documentation while working on it and could later submit his/her contributions to the public repository. I obviously updated the README.md file of the project to explain how to properly work with the documentation.

I also continued to add docstrings to the project, grabbing the explanations provided for each method by the OCDM specification itself.

Looking for a technique to randomly access a line inside a text file with O(1) complexity

I started working for the first time on the provenance layer: my initial task is to find an efficient way to handle the integer counters needed to keep track of the last snapshot number of each entity stored in the OpenCitations Context Corpus. This is now done in a little convoluted way: for each type of entity (“an”, “be”, “br”, “rp”, “ci”, …), the long list of integer numbers related to each instance of it is splitted and stored in multiple text files. Each text file has a line for every instance containing the decimal representation of the actual value of the respective counter. To retrieve and – eventually – modify a line, the file is read sequentially line by line: this is not a problem as far as the size of each file is kept small.

Exploring the possibility of simplifying this mechanism by using only one text file for each type of entity, I had to find a way to efficiently retrieve a random line from a huge file, beating the performances of a sequential search. Having deeply studied the case, I came out with a simple solution that I believe should be evaluated as an alternative to the current strategy.

What’s required is a text file with the same byte length for each line. I created a Python script which generates such a file using random numbers and writing 700 million lines of 4 ASCII characters each (the fourth is reserved for the \n character). If the decimal representation of the counter value is shorter than 3 chars, the remaining slots are filled with a special character (for example a blank space). Then, a line in the middle of the file (from 1 to 700M) is randomly choosen. The script then tries to retrieve the respective counter value firstly with a sequential search and secondly with a different algorithm that exploits the seek method of the Python’s File class (moving the I/O pointer directly to line_number * line_length byte, where the correct line starts). Some quick tests showed that the latter strategy proves to be optimal when the file is big in size, otherwise the OS cache often covers pretty well the weaknesses of the sequential search (making it the fastest strategy).

The “seek and read” strategy enjoys a O(1) complexity, requiring a constant access time to each line of the file, making it quite suitable for our job. As as potential drawback, particular attention must be kept by human agents that want to interact with the text file, since they must be sure not to change the length of any line as they would skew every reading from that point on.