Performance improvement in creating the extra index #140

AlyHdr · 2022-01-12T12:25:46Z

We are introducing a major improvement in creating the extra predicates index where the major fix is in the Bitmap375 that basically represents the bitmaps of the triples. There is an operation which is select1(long n) which finds the position where (n) ones have appeared up to that position, and the idea behind this operation is to get the number of distinct sub trees in the forest of the SPO (i.e get the position of the parent) as arranged in the figure below:

And this operation in its core uses a binary search to look over the array of longs representing the bits in the bitmap, and find the block where the (n) number of bits has been set to 1.

We first removed the SortUtils binary search because there was a bug with bigger indexes, and we noticed that the search is stopping on the first hit and it then tries to move linearly on the array until it finds the position of the first match. And this could be catastrophic when it comes to large arrays especially with repeated values. Taking the example below:

arr = [ 1 ,2 ,2 ,2 ,2 ,2 ,2 ,2, 3 ,3 ,4 ]
key = 2
task: binary search to find the first position of 2
solution: keep doing the binary search until we hit the first position of 2 and not stopping on the first hit and going back iteratively.

This solution increases the speed of the generation of the index tremendously, and we provide a test on a HDT file that is 5.6GB with 8GB of RAM and we have the following results:

Current implementation:

[INFO] Scanning for projects...
[INFO] 
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 2.1.3-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
Predicate Bitmap in 6 sec 87 ms 544 us
Count predicates in 36 min 39 sec 442 ms 61 us
Count Objects in 15 sec 208 ms 517 us Max was: 34075063
Bitmap in 434 ms 194 us
Object references in 22 min 38 sec 636 ms 255 us
Sort object sublists in 51 sec 242 ms 391 us
Count predicates in 4 sec 874 ms 823 us
Index generated in 23 min 50 sec 400 ms 526 us
Index generated and saved in 1 hour 15 min 4 sec 576 ms 469 us
>> exit
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:21 h
[INFO] Finished at: 2022-01-12T13:14:10+01:00
[INFO] ------------------------------------------------------------------------

With the new solution:

[INFO] Scanning for projects...
[INFO] 
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 2.1.3-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
Predicate Bitmap in 4 sec 973 ms 900 us
Count predicates in 26 sec 991 ms 822 us
Count Objects in 8 sec 520 ms 664 us Max was: 34075063
Bitmap in 307 ms 68 us
Object references in 55 sec 125 ms 906 us
Sort object sublists in 50 sec 222 ms 213 us
Count predicates in 4 sec 548 ms 744 us
Index generated in 1 min 58 sec 725 ms 100 us
Index generated and saved in 2 min 38 sec 255 ms 791 us
>> exit
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:02 min
[INFO] Finished at: 2022-01-12T13:19:46+01:00
[INFO] ------------------------------------------------------------------------

We can see that the time it takes to generate the extra index goes down from 1 hour 15 min to 2 min and 38 sec !!

We as well added a buffered output stream to write the file to disk at the end of the index creation that speeds up the writing to disk part as well.

…to write the HDT

D063520 · 2022-01-12T12:34:33Z

I worked on this too. I'm for these changes. To summarize:

we add a buffered stream to write to the file
It looks like many changes but most of them are same tabular alignments. We replaced the binary search with a slightly modified one (I restate the above):

arr = [ 1 ,2 ,2 ,2 ,2 ,2 ,2 ,2, 3 ,3 ,4 ]
key = 2
task: binary search to find the first position of 2
problem: a classical binary search would give back the 2 in the middle. But we are searching the one at the beginning
solution: keep doing the binary search until we hit the first position of 2 and not stopping on the first hit and going back iteratively.

With these changes we are achieving a similar performance for index creating like the c++ version.

mielvds · 2022-01-12T13:47:25Z

Cool! If I understand correctly, this doesn't change anything to the HDT format itself, right?

AlyHdr · 2022-01-12T13:49:06Z

Cool! If I understand correctly, this doesn't change anything to the HDT format itself, right?

Yes it's just a performance improvement...

using a full binary search in Bitmap375 and a buffered output stream …

dd164e0

…to write the HDT

AlyHdr changed the title ~~Performance improvement in creating the predicates index~~ Performance improvement in creating the extra index Jan 12, 2022

mielvds added this to the 2.1.3 milestone Jan 12, 2022

mielvds merged commit 71284fa into rdfhdt:master Jan 12, 2022

ate47 mentioned this pull request Jan 16, 2023

Disk co-indexing with KWay merge #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement in creating the extra index #140

Performance improvement in creating the extra index #140

AlyHdr commented Jan 12, 2022

D063520 commented Jan 12, 2022

mielvds commented Jan 12, 2022

AlyHdr commented Jan 12, 2022

Performance improvement in creating the extra index #140

Performance improvement in creating the extra index #140

Conversation

AlyHdr commented Jan 12, 2022

D063520 commented Jan 12, 2022

mielvds commented Jan 12, 2022

AlyHdr commented Jan 12, 2022