Disk co-indexing with KWay merge #187

ate47 · 2023-01-16T10:57:28Z

This PR contains a version of the first generateIndex method for the co-index (.hdt.index* file), but implemented using the k-ways merge sort method of #162, I was trying to index Wikidata nt file (17.5B triples) with my machine with only 16GB of RAM and noticed that even with the optimized implementation of #140 with the disk version #178, it wasn't able to handle the large amount of triples (too many random accesses)

With this version I was able to create the co-index in 5h17min. (the other implementation was stopped after 8h)

A lot of changes were made, I tried to optimize the previous implementation before working on the new one.

API changes

Add default implementations for HDTOptions and add the options to config the indexing:

bitmaptriples.indexmethod - Indexing method for the bitmap triples, can be used with:
- recommended - Recommended implementation, default value
- optimized - Memory optimized option (current default for recommended)
- legacy - Legacy implementation, fast, but memory inefficient
- disk - Disk option, handle the indexing on disk to reduce usage
bitmaptriples.indexmethod.disk.compressWorker - Number of core used to index the HDT with disk index method.
bitmaptriples.indexmethod.disk.chunkSize - Maximum size of a chunk for the disk index method
bitmaptriples.indexmethod.disk.fileBufferSize - Size of the file buffers for the disk index method
bitmaptriples.indexmethod.disk.maxFileOpen - Maximum number of file disk index method can open at the same time
bitmaptriples.indexmethod.disk.kway - log of the number of way the system can merge in disk index method

CLI changes

add -options [opt], -config [file] and -color to hdtSearch.

CORE changes

BitmapXBig

Implementation of Bitmap375Big and Bitmap64Big, working with both disk and memory bitmaps, use LargeArrays if required, removing the limit of 128B elements and allowing the Bitmap375 to have a disk co-index. The previous implementations are now deprecated and were replaced in the library with the new implementation.

LongLargeArray bug

A bug was found in LongLargeArray preventing a fast 0 set of the arrays, knowing how not updated this library is, a fix was added to IOUtil.

LongArrayDisk bug

Another bug to the LongArrayDisk was found, a fix is contained in this method.

// LongArrayDisk#192
mapping.position(startBuffer); // before
mapping.position(startBuffer + shift); // after

KWay sort to create index

The legacy code was using ArrayLists to sort the seqZ/bitmapZ/seqY ids to create the object index, but it was taking a lot of memory, the optimized implementation was using bitmap375 to use the rank/select operations to reduce the memory usage, but it was still a lot of memory and the accesses were more random so a disk version can't work. So I used the kway merger of the disk generation method to sort like with the legacy version.

To store the chunks during the sort, I've applied a basic compression to store the ids with the delta instead of the plain values, it saved 120GB for a 17.5B seqZ, leading to max chunk of 80GB, but maybe it can be optimized to reduce the time/disk usage.

Tests

Obviously everything is tested with unit tests.

EDIT

After reindeing the wikidata HDT, I have these results, more accurate than the 5h described previously:

2023-01-18 19:33:43.583  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Pair sorted in 2 hour 36 min 13 sec 277 ms 753 us
2023-01-18 20:18:30.249  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : indexZ/bitmapIndexZ completed in 44 min 46 sec 656 ms 818 us
2023-01-18 20:24:59.594  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Predicate count completed in 6 min 29 sec 327 ms 128 us
2023-01-18 20:24:59.604  INFO 3308 --- [nio-1234-exec-1] o.rdfhdt.hdt.triples.impl.BitmapTriples  : Index generated in 6 min 29 sec 340 ms 405 us
2023-01-18 20:52:32.084  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Predicate Bitmap in 27 min 32 sec 364 ms 128 us
2023-01-18 21:49:38.413  INFO 3308 --- [nio-1234-exec-1] o.r.h.triples.impl.PredicateIndexArray   : Count predicates in 57 min 6 sec 304 ms 620 us
Index generated and saved in 5 hour 30 min 34 sec 753 ms 598 us

…d allow

D063520 · 2023-01-16T13:30:13Z

great, thank you for this contribution!!!!

Use KWayMerger for HDT co-index creation to reduce random accesses an…

ac44f70

…d allow

D063520 merged commit bf26d0a into rdfhdt:dev Jan 16, 2023

ate47 mentioned this pull request Feb 8, 2023

Question about Bitmap Triples Iterator ZFOQ implementation #191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk co-indexing with KWay merge #187

Disk co-indexing with KWay merge #187

ate47 commented Jan 16, 2023 •

edited

Loading

D063520 commented Jan 16, 2023

Disk co-indexing with KWay merge #187

Disk co-indexing with KWay merge #187

Conversation

ate47 commented Jan 16, 2023 • edited Loading

API changes

CLI changes

CORE changes

BitmapXBig

LongLargeArray bug

LongArrayDisk bug

KWay sort to create index

Tests

EDIT

D063520 commented Jan 16, 2023

ate47 commented Jan 16, 2023 •

edited

Loading