Performance improvement in creating the extra index #140
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We are introducing a major improvement in creating the extra predicates index where the major fix is in the Bitmap375 that basically represents the bitmaps of the triples. There is an operation which is select1(long n) which finds the position where (n) ones have appeared up to that position, and the idea behind this operation is to get the number of distinct sub trees in the forest of the SPO (i.e get the position of the parent) as arranged in the figure below:
And this operation in its core uses a binary search to look over the array of longs representing the bits in the bitmap, and find the block where the (n) number of bits has been set to 1.
We first removed the SortUtils binary search because there was a bug with bigger indexes, and we noticed that the search is stopping on the first hit and it then tries to move linearly on the array until it finds the position of the first match. And this could be catastrophic when it comes to large arrays especially with repeated values. Taking the example below:
arr = [ 1 ,2 ,2 ,2 ,2 ,2 ,2 ,2, 3 ,3 ,4 ]
key = 2
task: binary search to find the first position of 2
solution: keep doing the binary search until we hit the first position of 2 and not stopping on the first hit and going back iteratively.
This solution increases the speed of the generation of the index tremendously, and we provide a test on a HDT file that is 5.6GB with 8GB of RAM and we have the following results:
Current implementation:
With the new solution:
We can see that the time it takes to generate the extra index goes down from 1 hour 15 min to 2 min and 38 sec !!
We as well added a buffered output stream to write the file to disk at the end of the index creation that speeds up the writing to disk part as well.