generateHDTDisk : a merge sort on disk to create HDTs #162
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request ask to add a disk-based HDT generation method:
generateHDT
.This method will use merge sort to merge the sections and the triples. It is only available to create FourSectionDictionary based HDT. It allow to create HDT without having the memory to load it into memory.
API Changes
HDTManager
In the HDTManager abstract class, we now have 4 new methods to implement
A method consuming an
InputStream
was asked in #86, so I've added it for both, Disk and Memory generate.These methods also allow the usage of compressed file with the add of
org.rdfhdt.hdt.enums.CompressionType
to uncompress without the name of the file.added to that multiple versions to use these implementations with
HDTManager.generateHDTDisk(...)
.org.rdfhdt.hdt.options.HDTOptionsKeys
To specify the specs of HDTOptions, we were asked to use plain string from the doc, instead, I've created the utility class to get the key names. I've added some key/values from the generateHDT method
UnicodeEscape fix
A small fix was made in this commit to fix the
UnicodeEscape#escapeString(String, Appendable)
method if the unicode delimiter isn't specify (no""
or<>
)org.rdfhdt.hdt.listener.MultiThreadListener
The current implementation of ProgressListener wasn't taking into account multiple threads computations. To fix this issue, I've added a new ProgressListener type, the multiple thread listener. Working like a progress listener, but with the origin thread.
An implementation was created in the HDT Java Command line Tools module.
Core Changes
PlainHeader
This pull request contains a fix for loaded/mapped hdt, the header wasn't containing the baseUri.
Generate Disk
This method is splitted into multiple phases, the parser is only using once the RDF file, so the implementation is be the same for File (String), InputStream or Iterator of TripleString.
Write triples/merge sections
For each triple, we will assign a new id to each node ((s, sid), (p, pid), (o, oid)), we attach to the component these ids at the same time it sort the components to 3 sections files with a merge sort. The ids are the number of the triple, so we don't need to store the triples.
At the end, we have 3 files of sorted compressed section file with an id attached to each strings (node, node_id) for subject, predicate, object
Create sections/id map files
With the raw triples, we create the 4 sections, removing the duplicates and get shared elements.
At the same time we fill 3 map files (SequenceLog64BigDisk) to be able to map the initial node id (sid, pid or oid) to the position in one of the 4 sections, we are using Sequence to reduce the disk usage of the maps.
We mark duplicate with the 1st bit and shared element with the 2nd bit, the other bits are the id in section for non duplicates, id of the original for duplicate.
So for example, if we have:
0b1100
-> Non shared element with index 3 (0b11
)0b1101
-> Shared element with index 3 (0b11
)0b1101
-> Duplicate element, the section index is in the map at the index 3 (0b11
).The dictionary is completed.
Map triples with section IDs/merge triples
During the first step, we have created nodes of the sections with incremental ids 1..numTriples, so we simply need to use the maps to map them using the maps created during the second step and sort them with merge sort.
Create triples
With the triple sorted, we can create the bitmap of the triples.
The triples are completed.
Create header
Simply create the header with the Dictionary/Triples parts, the original size isn't computed the same way as the generateHDT memory method, so the value can differ.
The Header and HDT is completed
Options
Findable with HDTOptionsKeys, the generate method can be config with multiple option keys
LOADER_DISK_COMPRESSION_MODE_KEY
Change thesort method, can be 2 values:
LOADER_DISK_COMPRESSION_WORKER_KEY
(For complete sort only)
The maximum count of workers to merge the files
LOADER_DISK_CHUNK_SIZE_KEY
The maximum size of a chunk to merge sort, by default it is 85% of 1 third of the allocated RAM.
LOADER_DISK_LOCATION_KEY
Set the working directory, by default it is set into a temporary folder, will be mkdirs before and delete after usage.
LOADER_DISK_FUTURE_HDT_LOCATION_KEY
Set the future HDT location, if this value is set, the method will generate the HDT file and map it, it reduces the RAM usage, by default the method will load into memory the HDT without creating a file.
Tests
To test this method, I'm generating 2 HDT with generateHDT and generateHDTDisk with map/load or partial/complete sort and check the equality of the 2 HDTs.
Some other tests are also present to test the writer/reader of in compression files and the mapping.
HDT Java Command line Tools Changes
Two new parameters were added to the rdf2hdt tool:
-disk
- Specify we want to use the generateHDTDisk version and not the generateHDT to create the HDT-disklocation [workingLocation]
- Specify the working directory, shortcut for theLOADER_DISK_LOCATION_KEY
option.For the disk generation, the new
MultiThreadListener
is used.For HDTVerify, it also verify for duplicated elements and print the current section.