generateHDTDisk : a merge sort on disk to create HDTs #162

ate47 · 2022-05-12T14:09:41Z

This pull request ask to add a disk-based HDT generation method: generateHDT.

This method will use merge sort to merge the sections and the triples. It is only available to create FourSectionDictionary based HDT. It allow to create HDT without having the memory to load it into memory.

API Changes

HDTManager

In the HDTManager abstract class, we now have 4 new methods to implement

protected abstract HDT doGenerateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
protected abstract HDT doGenerateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

A method consuming an InputStream was asked in #86, so I've added it for both, Disk and Memory generate.

These methods also allow the usage of compressed file with the add of org.rdfhdt.hdt.enums.CompressionType to uncompress without the name of the file.

added to that multiple versions to use these implementations with HDTManager.generateHDTDisk(...).

public static HDT generateHDT(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(String rdfFileName, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, String filename, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, CompressionType compressionType, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(InputStream fileStream, String baseURI, RDFNotation rdfNotation, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;
public static HDT generateHDTDisk(Iterator<TripleString> iterator, String baseURI, HDTOptions hdtFormat, ProgressListener listener) throws IOException, ParserException;

org.rdfhdt.hdt.options.HDTOptionsKeys

To specify the specs of HDTOptions, we were asked to use plain string from the doc, instead, I've created the utility class to get the key names. I've added some key/values from the generateHDT method

UnicodeEscape fix

A small fix was made in this commit to fix the UnicodeEscape#escapeString(String, Appendable) method if the unicode delimiter isn't specify (no "" or <>)

org.rdfhdt.hdt.listener.MultiThreadListener

The current implementation of ProgressListener wasn't taking into account multiple threads computations. To fix this issue, I've added a new ProgressListener type, the multiple thread listener. Working like a progress listener, but with the origin thread.

An implementation was created in the HDT Java Command line Tools module.

Core Changes

PlainHeader

This pull request contains a fix for loaded/mapped hdt, the header wasn't containing the baseUri.

Generate Disk

This method is splitted into multiple phases, the parser is only using once the RDF file, so the implementation is be the same for File (String), InputStream or Iterator of TripleString.

write and merge sections
create sections/id map files
map triples with section IDs/merge triples
create triples
create header

Write triples/merge sections

For each triple, we will assign a new id to each node ((s, sid), (p, pid), (o, oid)), we attach to the component these ids at the same time it sort the components to 3 sections files with a merge sort. The ids are the number of the triple, so we don't need to store the triples.

At the end, we have 3 files of sorted compressed section file with an id attached to each strings (node, node_id) for subject, predicate, object

Create sections/id map files

With the raw triples, we create the 4 sections, removing the duplicates and get shared elements.

At the same time we fill 3 map files (SequenceLog64BigDisk) to be able to map the initial node id (sid, pid or oid) to the position in one of the 4 sections, we are using Sequence to reduce the disk usage of the maps.

We mark duplicate with the 1st bit and shared element with the 2nd bit, the other bits are the id in section for non duplicates, id of the original for duplicate.

So for example, if we have:

0b1100 -> Non shared element with index 3 (0b11)
0b1101 -> Shared element with index 3 (0b11)
0b1101 -> Duplicate element, the section index is in the map at the index 3 (0b11).

The dictionary is completed.

Map triples with section IDs/merge triples

During the first step, we have created nodes of the sections with incremental ids 1..numTriples, so we simply need to use the maps to map them using the maps created during the second step and sort them with merge sort.

Create triples

With the triple sorted, we can create the bitmap of the triples.

The triples are completed.

Create header

Simply create the header with the Dictionary/Triples parts, the original size isn't computed the same way as the generateHDT memory method, so the value can differ.

The Header and HDT is completed

Options

Findable with HDTOptionsKeys, the generate method can be config with multiple option keys

LOADER_DISK_COMPRESSION_MODE_KEY

Change thesort method, can be 2 values:

LOADER_DISK_COMPRESSION_MODE_VALUE_COMPLETE - will completely merge the files on disk, increase disk usage, default value
LOADER_DISK_COMPRESSION_MODE_VALUE_PARTIAL - will partially sort the files and merge on fly, increase memory usage, but reduce disk usage

LOADER_DISK_COMPRESSION_WORKER_KEY

(For complete sort only)

The maximum count of workers to merge the files

LOADER_DISK_CHUNK_SIZE_KEY

The maximum size of a chunk to merge sort, by default it is 85% of 1 third of the allocated RAM.

LOADER_DISK_LOCATION_KEY

Set the working directory, by default it is set into a temporary folder, will be mkdirs before and delete after usage.

LOADER_DISK_FUTURE_HDT_LOCATION_KEY

Set the future HDT location, if this value is set, the method will generate the HDT file and map it, it reduces the RAM usage, by default the method will load into memory the HDT without creating a file.

Tests

To test this method, I'm generating 2 HDT with generateHDT and generateHDTDisk with map/load or partial/complete sort and check the equality of the 2 HDTs.

Some other tests are also present to test the writer/reader of in compression files and the mapping.

HDT Java Command line Tools Changes

Two new parameters were added to the rdf2hdt tool:

-disk - Specify we want to use the generateHDTDisk version and not the generateHDT to create the HDT
-disklocation [workingLocation] - Specify the working directory, shortcut for the LOADER_DISK_LOCATION_KEY option.

For the disk generation, the new MultiThreadListener is used.

For HDTVerify, it also verify for duplicated elements and print the current section.

mielvds · 2022-05-17T09:11:11Z

Hi @ate47 , looks neat! Question though: how does this compare to simply using a on-disk key value store like RocksDB as a drop in replacement for the in-memory data structures?

ate47 · 2022-05-17T09:43:57Z

Hi @ate47 , looks neat! Question though: how does this compare to simply using a on-disk key value store like RocksDB as a drop in replacement for the in-memory data structures?

I think some tests are mandatory to answer to this question, but my guess is that it would use more space on disk, because I am compressing the dictionaries/triples while parsing. I've never tried this technology, but the sections creation would probably be faster from what I have read, for the triples compression, I don't know if I can use this to sort the triples without using too much memory (disk/ram), so I would be using the same algorithm as the one in the pull request, so it would be the same.

ate47 force-pushed the external_merge branch from 929f38d to 5ee6368 Compare May 17, 2022 09:20

ate47 force-pushed the external_merge branch from e5946da to 9409243 Compare May 19, 2022 13:22

ate47 force-pushed the external_merge branch 5 times, most recently from 7c9ba69 to 1cc8c7d Compare July 20, 2022 11:24

ate47 force-pushed the external_merge branch from 1cc8c7d to 7c3d920 Compare September 15, 2022 12:31

add HDTGenerateDisk method with tests

5f011f9

ate47 force-pushed the external_merge branch from 7c3d920 to 5f011f9 Compare September 15, 2022 12:42

ate47 mentioned this pull request Oct 21, 2022

HDTCatTree + HDTGenDisk ate47/hdt-java#5

Merged

ate47 closed this Nov 4, 2022

ate47 mentioned this pull request Nov 21, 2022

HDTCatTree + HDTGenDisk #179

Merged

ate47 mentioned this pull request Jan 16, 2023

Disk co-indexing with KWay merge #187

Merged

ate47 mentioned this pull request Feb 1, 2023

Provide HDTManager::generateHDT consuming an InputStream #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generateHDTDisk : a merge sort on disk to create HDTs #162

generateHDTDisk : a merge sort on disk to create HDTs #162

ate47 commented May 12, 2022 •

edited

Loading

mielvds commented May 17, 2022

ate47 commented May 17, 2022

generateHDTDisk : a merge sort on disk to create HDTs #162

generateHDTDisk : a merge sort on disk to create HDTs #162

Conversation

ate47 commented May 12, 2022 • edited Loading

API Changes

HDTManager

org.rdfhdt.hdt.options.HDTOptionsKeys

UnicodeEscape fix

org.rdfhdt.hdt.listener.MultiThreadListener

Core Changes

PlainHeader

Generate Disk

Write triples/merge sections

Create sections/id map files

Map triples with section IDs/merge triples

Create triples

Create header

Options

LOADER_DISK_COMPRESSION_MODE_KEY

LOADER_DISK_COMPRESSION_WORKER_KEY

LOADER_DISK_CHUNK_SIZE_KEY

LOADER_DISK_LOCATION_KEY

LOADER_DISK_FUTURE_HDT_LOCATION_KEY

Tests

HDT Java Command line Tools Changes

mielvds commented May 17, 2022

ate47 commented May 17, 2022

ate47 commented May 12, 2022 •

edited

Loading