Literals dictionary - splitting objects into subsections by datatype #139

AlyHdr · 2022-01-11T14:19:08Z

We introduce a new structure of the dictionary object's section in this pull request, the model proposed is introduced and explained in the thesis of Javier D. Fernandez attached here: memoriaImpresion4Diciembre.pdf

In this new implementation the main idea is to split the objects section into multiple sections per datatype (i.e integers, dates, floats, strings ...etc), and the main purpose of this approach is to be able to get information about the literals natively and almost in constant time. We can do filtering with (datatype,str,isLiteral,lang..etc) natively on the dictionary, while in the previous implementation we had to iterate through all objects in order to do filters and this can be catastrophically bad in performance.

The figure below shows the global view of the dictionary with splitted sections:

In order to create the sections, HDT uses a mechanism first to create a temporary dictionary by parsing the RDF file and creating some kind of a hashmap sections (s,p,o and shared) which basically stores the strings and their given IDs. In the case of subsections we needed to store them sorted so we kept the same implementation but we reused the same class HasDictionarySection with a customised extra hashmap to keep the counts of the literals per datatype <key:Datatype, value:count>. Then we iterate over this section to create the subsections and we use the stored counts to distinguish when a section is finished.
For sure the sections are sorted lexicographically and we use the same logic as proposed in the thesis where it shows that there must be a kind of a mapping table that keeps a set of pointers of the sections per datatype, in order to get a specific section on query time. To fit our needs, we use a sorted hashmap to keep the order of the sections lexicographically and to be able to get a specific section and search over it easily using the two main functionalites (idToString and stringToId). And for sure this mapping is written to the dictionary before the the objects sections are written in order to read it when loading or mapping the generated HDT file.

The second part is about the local-to-global and global-to-local ID conversion, and in those operations we needed as well to adapt to the objects subsections architecture:

global-to-local : We have to iterate over the subsections and accumulate the count of the given entries until it matches up with given Global Id and then we know that we are in the right subsection and extract the string of the given id in the idToString method.
local-to-global: For this operation we need to get first the subsection from the map using the datatype as a given key, and then we can have the local id which is added to the sum of the count of the sections + the shared and then we have the global of id of the given object.

We introduced as well some other functionalities to be able to get the datatype of a given ID and to get the range of a given datatype:

dataTypeOfId(long id)
getDataTypeRange(String dataType)

A set of tests has been added as well to test the functionalities of the dictionary.

mielvds · 2022-01-12T13:50:58Z

Hi @AlyHdr , thanks for this interesting work! These changes introduce a version change in the index. Could you elaborate on the changes you made on the version numbers and the header?

AlyHdr · 2022-01-12T13:58:42Z

The version changes are to distinguish the new underlying dictionary because if you have an old version of the repository and you try to run load we would like to throw an error because the version of the HDT files is not compatible anymore with the code that is running (one has to upgrade to the latest version).

In any case in order to generate an HDT file with this version, one has to pass extra parameters to the rdf2hdt or on the HDTSpecfication to specify which dictionary to use like so:

From cli:

./bin/rdf2hdt.sh ../data/test.nt ../data/temp.hdt -options "tempDictionary.impl=multHash;dictionary.type=dictionaryMultiObj"

From java:

String file1 = classLoader.getResource("example1.nt").getFile();
HDTSpecification spec = new HDTSpecification();
spec.setOptions("tempDictionary.impl=multHash;dictionary.type=dictionaryMultiObj;");
HDT hdt1 = HDTManager.generateHDT(new File(file1).getAbsolutePath(), "uri", RDFNotation.NTRIPLES, spec, null);

AlyHdr · 2022-01-12T14:01:09Z

I think there is a problem with this PR because it has the commits of the previous one #138 by mistake :| , but it may be merged without conflicts..

D063520 · 2022-01-12T14:43:22Z

Hi, we have some changes that build on top of this (basically hdtCat on top of this data structure)... would it be possible to merge this in the code base, maybe into a separate branch (like 3.0.0)? Like this we can continue. Sorry for making this all in one shot ...

mielvds · 2022-01-13T09:38:28Z

@D063520 done.

D063520 · 2022-01-13T09:46:32Z

Thank you!

Ali Haidar added 2 commits January 11, 2022 12:38

New dictionary splitting objects into subsections per literal type

4e619cf

added tests for the extra functionaltities

fb33ab9

AlyHdr changed the title ~~Literals dictionary - splitting objects into subsections per datatype~~ Literals dictionary - splitting objects into subsections by datatype Jan 11, 2022

mielvds added this to the 3.0.0 milestone Jan 12, 2022

mielvds changed the base branch from master to 3.0.0 January 13, 2022 09:36

mielvds merged commit 0ffcb43 into rdfhdt:3.0.0 Jan 13, 2022

mielvds mentioned this pull request Feb 22, 2022

Increment HDT and index version to 3.0.0 #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literals dictionary - splitting objects into subsections by datatype #139

Literals dictionary - splitting objects into subsections by datatype #139

AlyHdr commented Jan 11, 2022

mielvds commented Jan 12, 2022

AlyHdr commented Jan 12, 2022 •

edited

Loading

AlyHdr commented Jan 12, 2022

D063520 commented Jan 12, 2022

mielvds commented Jan 13, 2022

D063520 commented Jan 13, 2022

Literals dictionary - splitting objects into subsections by datatype #139

Literals dictionary - splitting objects into subsections by datatype #139

Conversation

AlyHdr commented Jan 11, 2022

mielvds commented Jan 12, 2022

AlyHdr commented Jan 12, 2022 • edited Loading

AlyHdr commented Jan 12, 2022

D063520 commented Jan 12, 2022

mielvds commented Jan 13, 2022

D063520 commented Jan 13, 2022

AlyHdr commented Jan 12, 2022 •

edited

Loading