Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk co-indexing with KWay merge #187

Merged
merged 1 commit into from
Jan 16, 2023
Merged

Conversation

ate47
Copy link
Contributor

@ate47 ate47 commented Jan 16, 2023

This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)

With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)

A lot of changes were made, I tried to optimize the previous implementation before working on the new one.

API changes

Add default implementations for HDTOptions and add the options to config the indexing:

  • bitmaptriples.indexmethod - Indexing method for the bitmap triples, can be used with:
    • recommended - Recommended implementation, default value
    • optimized - Memory optimized option (current default for recommended)
    • legacy - Legacy implementation, fast, but memory inefficient
    • disk - Disk option, handle the indexing on disk to reduce usage
  • bitmaptriples.indexmethod.disk.compressWorker - Number of core used to index the HDT with disk index method.
  • bitmaptriples.indexmethod.disk.chunkSize - Maximum size of a chunk for the disk index method
  • bitmaptriples.indexmethod.disk.fileBufferSize - Size of the file buffers for the disk index method
  • bitmaptriples.indexmethod.disk.maxFileOpen - Maximum number of file disk index method can open at the same time
  • bitmaptriples.indexmethod.disk.kway - log of the number of way the system can merge in disk index method

CLI changes

add -options [opt], -config [file] and -color to hdtSearch.

CORE changes

BitmapXBig

Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.

LongLargeArray bug

A bug was found in LongLargeArray preventing a fast 0 set of the arrays, knowing how not updated this library is, a fix was added to IOUtil.

LongArrayDisk bug

Another bug to the LongArrayDisk was found, a fix is contained in this method.

// LongArrayDisk#192
mapping.position(startBuffer); // before
mapping.position(startBuffer + shift); // after

KWay sort to create index

The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.

To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.

Tests

Obviously everything is tested with unit tests.

EDIT

After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously:

2023-01-18 19:33:43.583  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Pair sorted in 2 hour 36 min 13 sec 277 ms 753 us
2023-01-18 20:18:30.249  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : indexZ/bitmapIndexZ completed in 44 min 46 sec 656 ms 818 us
2023-01-18 20:24:59.594  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Predicate count completed in 6 min 29 sec 327 ms 128 us
2023-01-18 20:24:59.604  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Index generated in 6 min 29 sec 340 ms 405 us
2023-01-18 20:52:32.084  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Predicate Bitmap in 27 min 32 sec 364 ms 128 us
2023-01-18 21:49:38.413  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Count predicates in 57 min 6 sec 304 ms 620 us
Index generated and saved in 5 hour 30 min 34 sec 753 ms 598 us

@D063520
Copy link
Contributor

D063520 commented Jan 16, 2023

great, thank you for this contribution!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants