Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)
With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)
A lot of changes were made, I tried to optimize the previous implementation before working on the new one.
API changes
Add default implementations for HDTOptions and add the options to config the indexing:
bitmaptriples.indexmethod
- Indexing method for the bitmap triples, can be used with:recommended
- Recommended implementation, default valueoptimized
- Memory optimized option (current default for recommended)legacy
- Legacy implementation, fast, but memory inefficientdisk
- Disk option, handle the indexing on disk to reduce usagebitmaptriples.indexmethod.disk.compressWorker
- Number of core used to index the HDT withdisk
index method.bitmaptriples.indexmethod.disk.chunkSize
- Maximum size of a chunk for thedisk
index methodbitmaptriples.indexmethod.disk.fileBufferSize
- Size of the file buffers for thedisk
index methodbitmaptriples.indexmethod.disk.maxFileOpen
- Maximum number of filedisk
index method can open at the same timebitmaptriples.indexmethod.disk.kway
- log of the number of way the system can merge indisk
index methodCLI changes
add
-options [opt]
,-config [file]
and-color
to hdtSearch.CORE changes
BitmapXBig
Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.
LongLargeArray bug
A bug was found in LongLargeArray preventing a fast 0 set of the arrays, knowing how not updated this library is, a fix was added to IOUtil.
LongArrayDisk bug
Another bug to the LongArrayDisk was found, a fix is contained in this method.
KWay sort to create index
The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.
To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.
Tests
Obviously everything is tested with unit tests.
EDIT
After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously: