Literals dictionary - splitting objects into subsections by datatype #139
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We introduce a new structure of the dictionary object's section in this pull request, the model proposed is introduced and explained in the thesis of Javier D. Fernandez attached here: memoriaImpresion4Diciembre.pdf
In this new implementation the main idea is to split the objects section into multiple sections per datatype (i.e integers, dates, floats, strings ...etc), and the main purpose of this approach is to be able to get information about the literals natively and almost in constant time. We can do filtering with (datatype,str,isLiteral,lang..etc) natively on the dictionary, while in the previous implementation we had to iterate through all objects in order to do filters and this can be catastrophically bad in performance.
The figure below shows the global view of the dictionary with splitted sections:
In order to create the sections, HDT uses a mechanism first to create a temporary dictionary by parsing the RDF file and creating some kind of a hashmap sections (s,p,o and shared) which basically stores the strings and their given IDs. In the case of subsections we needed to store them sorted so we kept the same implementation but we reused the same class HasDictionarySection with a customised extra hashmap to keep the counts of the literals per datatype <key:Datatype, value:count>. Then we iterate over this section to create the subsections and we use the stored counts to distinguish when a section is finished.
For sure the sections are sorted lexicographically and we use the same logic as proposed in the thesis where it shows that there must be a kind of a mapping table that keeps a set of pointers of the sections per datatype, in order to get a specific section on query time. To fit our needs, we use a sorted hashmap to keep the order of the sections lexicographically and to be able to get a specific section and search over it easily using the two main functionalites (idToString and stringToId). And for sure this mapping is written to the dictionary before the the objects sections are written in order to read it when loading or mapping the generated HDT file.
The second part is about the local-to-global and global-to-local ID conversion, and in those operations we needed as well to adapt to the objects subsections architecture:
We introduced as well some other functionalities to be able to get the datatype of a given ID and to get the range of a given datatype:
A set of tests has been added as well to test the functionalities of the dictionary.