Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Literals dictionary - splitting objects into subsections by datatype #139

Merged
merged 2 commits into from
Jan 13, 2022
Merged

Literals dictionary - splitting objects into subsections by datatype #139

merged 2 commits into from
Jan 13, 2022

Conversation

AlyHdr
Copy link

@AlyHdr AlyHdr commented Jan 11, 2022

We introduce a new structure of the dictionary object's section in this pull request, the model proposed is introduced and explained in the thesis of Javier D. Fernandez attached here: memoriaImpresion4Diciembre.pdf

In this new implementation the main idea is to split the objects section into multiple sections per datatype (i.e integers, dates, floats, strings ...etc), and the main purpose of this approach is to be able to get information about the literals natively and almost in constant time. We can do filtering with (datatype,str,isLiteral,lang..etc) natively on the dictionary, while in the previous implementation we had to iterate through all objects in order to do filters and this can be catastrophically bad in performance.

The figure below shows the global view of the dictionary with splitted sections:

Screenshot 2022-01-11 at 12 51 01

In order to create the sections, HDT uses a mechanism first to create a temporary dictionary by parsing the RDF file and creating some kind of a hashmap sections (s,p,o and shared) which basically stores the strings and their given IDs. In the case of subsections we needed to store them sorted so we kept the same implementation but we reused the same class HasDictionarySection with a customised extra hashmap to keep the counts of the literals per datatype <key:Datatype, value:count>. Then we iterate over this section to create the subsections and we use the stored counts to distinguish when a section is finished.
For sure the sections are sorted lexicographically and we use the same logic as proposed in the thesis where it shows that there must be a kind of a mapping table that keeps a set of pointers of the sections per datatype, in order to get a specific section on query time. To fit our needs, we use a sorted hashmap to keep the order of the sections lexicographically and to be able to get a specific section and search over it easily using the two main functionalites (idToString and stringToId). And for sure this mapping is written to the dictionary before the the objects sections are written in order to read it when loading or mapping the generated HDT file.

The second part is about the local-to-global and global-to-local ID conversion, and in those operations we needed as well to adapt to the objects subsections architecture:

  • global-to-local : We have to iterate over the subsections and accumulate the count of the given entries until it matches up with given Global Id and then we know that we are in the right subsection and extract the string of the given id in the idToString method.
  • local-to-global: For this operation we need to get first the subsection from the map using the datatype as a given key, and then we can have the local id which is added to the sum of the count of the sections + the shared and then we have the global of id of the given object.

We introduced as well some other functionalities to be able to get the datatype of a given ID and to get the range of a given datatype:

  • dataTypeOfId(long id)
  • getDataTypeRange(String dataType)

A set of tests has been added as well to test the functionalities of the dictionary.

@AlyHdr AlyHdr changed the title Literals dictionary - splitting objects into subsections per datatype Literals dictionary - splitting objects into subsections by datatype Jan 11, 2022
@mielvds
Copy link
Member

mielvds commented Jan 12, 2022

Hi @AlyHdr , thanks for this interesting work! These changes introduce a version change in the index. Could you elaborate on the changes you made on the version numbers and the header?

@AlyHdr
Copy link
Author

AlyHdr commented Jan 12, 2022

The version changes are to distinguish the new underlying dictionary because if you have an old version of the repository and you try to run load we would like to throw an error because the version of the HDT files is not compatible anymore with the code that is running (one has to upgrade to the latest version).

In any case in order to generate an HDT file with this version, one has to pass extra parameters to the rdf2hdt or on the HDTSpecfication to specify which dictionary to use like so:

From cli:

./bin/rdf2hdt.sh ../data/test.nt ../data/temp.hdt -options "tempDictionary.impl=multHash;dictionary.type=dictionaryMultiObj"

From java:

String file1 = classLoader.getResource("example1.nt").getFile();
HDTSpecification spec = new HDTSpecification();
spec.setOptions("tempDictionary.impl=multHash;dictionary.type=dictionaryMultiObj;");
HDT hdt1 = HDTManager.generateHDT(new File(file1).getAbsolutePath(), "uri", RDFNotation.NTRIPLES, spec, null);

@mielvds mielvds added this to the 3.0.0 milestone Jan 12, 2022
@AlyHdr
Copy link
Author

AlyHdr commented Jan 12, 2022

I think there is a problem with this PR because it has the commits of the previous one #138 by mistake :| , but it may be merged without conflicts..

@D063520
Copy link
Contributor

D063520 commented Jan 12, 2022

Hi, we have some changes that build on top of this (basically hdtCat on top of this data structure)... would it be possible to merge this in the code base, maybe into a separate branch (like 3.0.0)? Like this we can continue. Sorry for making this all in one shot ...

@mielvds mielvds changed the base branch from master to 3.0.0 January 13, 2022 09:36
@mielvds mielvds merged commit 0ffcb43 into rdfhdt:3.0.0 Jan 13, 2022
@mielvds
Copy link
Member

mielvds commented Jan 13, 2022

@D063520 done.

@D063520
Copy link
Contributor

D063520 commented Jan 13, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants