Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement in creating the extra index #140

Merged
merged 1 commit into from
Jan 12, 2022
Merged

Performance improvement in creating the extra index #140

merged 1 commit into from
Jan 12, 2022

Conversation

AlyHdr
Copy link

@AlyHdr AlyHdr commented Jan 12, 2022

We are introducing a major improvement in creating the extra predicates index where the major fix is in the Bitmap375 that basically represents the bitmaps of the triples. There is an operation which is select1(long n) which finds the position where (n) ones have appeared up to that position, and the idea behind this operation is to get the number of distinct sub trees in the forest of the SPO (i.e get the position of the parent) as arranged in the figure below:

Screenshot 2022-01-11 at 16 45 06

And this operation in its core uses a binary search to look over the array of longs representing the bits in the bitmap, and find the block where the (n) number of bits has been set to 1.

We first removed the SortUtils binary search because there was a bug with bigger indexes, and we noticed that the search is stopping on the first hit and it then tries to move linearly on the array until it finds the position of the first match. And this could be catastrophic when it comes to large arrays especially with repeated values. Taking the example below:

arr = [ 1 ,2 ,2 ,2 ,2 ,2 ,2 ,2, 3 ,3 ,4 ]
key = 2
task: binary search to find the first position of 2
solution: keep doing the binary search until we hit the first position of 2 and not stopping on the first hit and going back iteratively.

This solution increases the speed of the generation of the index tremendously, and we provide a test on a HDT file that is 5.6GB with 8GB of RAM and we have the following results:

Current implementation:

[INFO] Scanning for projects...
[INFO] 
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 2.1.3-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
Predicate Bitmap in 6 sec 87 ms 544 us
Count predicates in 36 min 39 sec 442 ms 61 us
Count Objects in 15 sec 208 ms 517 us Max was: 34075063
Bitmap in 434 ms 194 us
Object references in 22 min 38 sec 636 ms 255 us
Sort object sublists in 51 sec 242 ms 391 us
Count predicates in 4 sec 874 ms 823 us
Index generated in 23 min 50 sec 400 ms 526 us
Index generated and saved in 1 hour 15 min 4 sec 576 ms 469 us
>> exit
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:21 h
[INFO] Finished at: 2022-01-12T13:14:10+01:00
[INFO] ------------------------------------------------------------------------

With the new solution:

[INFO] Scanning for projects...
[INFO] 
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 2.1.3-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
Predicate Bitmap in 4 sec 973 ms 900 us
Count predicates in 26 sec 991 ms 822 us
Count Objects in 8 sec 520 ms 664 us Max was: 34075063
Bitmap in 307 ms 68 us
Object references in 55 sec 125 ms 906 us
Sort object sublists in 50 sec 222 ms 213 us
Count predicates in 4 sec 548 ms 744 us
Index generated in 1 min 58 sec 725 ms 100 us
Index generated and saved in 2 min 38 sec 255 ms 791 us
>> exit
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:02 min
[INFO] Finished at: 2022-01-12T13:19:46+01:00
[INFO] ------------------------------------------------------------------------

We can see that the time it takes to generate the extra index goes down from 1 hour 15 min to 2 min and 38 sec !!

We as well added a buffered output stream to write the file to disk at the end of the index creation that speeds up the writing to disk part as well.

@AlyHdr AlyHdr changed the title Performance improvement in creating the predicates index Performance improvement in creating the extra index Jan 12, 2022
@D063520
Copy link
Contributor

D063520 commented Jan 12, 2022

I worked on this too. I'm for these changes. To summarize:

  1. we add a buffered stream to write to the file
  2. It looks like many changes but most of them are same tabular alignments. We replaced the binary search with a slightly modified one (I restate the above):

arr = [ 1 ,2 ,2 ,2 ,2 ,2 ,2 ,2, 3 ,3 ,4 ]
key = 2
task: binary search to find the first position of 2
problem: a classical binary search would give back the 2 in the middle. But we are searching the one at the beginning
solution: keep doing the binary search until we hit the first position of 2 and not stopping on the first hit and going back iteratively.

With these changes we are achieving a similar performance for index creating like the c++ version.

@mielvds
Copy link
Member

mielvds commented Jan 12, 2022

Cool! If I understand correctly, this doesn't change anything to the HDT format itself, right?

@AlyHdr
Copy link
Author

AlyHdr commented Jan 12, 2022

Cool! If I understand correctly, this doesn't change anything to the HDT format itself, right?

Yes it's just a performance improvement...

@mielvds mielvds added this to the 2.1.3 milestone Jan 12, 2022
@mielvds mielvds merged commit 71284fa into rdfhdt:master Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants