Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've merged the PR rdfhdt#172 and rdfhdt#162 because the 2 algorithms are better together.
This pull request create 2 new methods to generate HDT,
catTree
anddisk
.catTree create small HDTs using the generateHDT method and HDTCat them to reduce memory usage or being able to create HDTs without having the memory to store it.
disk will use merge sort to merge the sections and the triples. It is only available to create FourSectionDictionary based HDT and MultiSectionDictionary. It allows to create an HDT without having the memory to load it into memory.
API Changes
It add 10 new methods in HDTManager and 7 to implement
It also 2 new classes to specify how to build the HDT with
HDTSupplier
and when to stop the RDF stream withRDFFluxStop
.Both HDTSupplier and RDFFluxStop have methods to quickly create instances.
It's also possible to use multiple limit with the
methods.
The loader type can be set to disk, cat or cat-disk to use the other methods with the base method.
org.rdfhdt.hdt.options.HDTOptionsKeys
To specify the specs of HDTOptions, we were asked to use plain string from the doc, instead, I've created the utility class to get the key names. I've added some key/values from the generateHDT method
UnicodeEscape fix
A small fix was made in this commit to fix the
UnicodeEscape#escapeString(String, Appendable)
method if the unicode delimiter isn't specify (no""
or<>
)org.rdfhdt.hdt.listener.MultiThreadListener
The current implementation of ProgressListener wasn't taking into account multiple threads computations. To fix this issue, I've added a new ProgressListener type, the multiple thread listener. Working like a progress listener, but with the origin thread.
An implementation was created in the HDT Java Command line Tools module.
Core changes
Implementation of HDTCatTree with tests.
Some fixes on the header part with HDTCat.
Remove of the System.out.println during HDTCat to use the ProgressListener
PlainHeader
This pull request contains a fix for loaded/mapped hdt, the header wasn't containing the baseUri.
Generate Disk
This method is splitted into multiple phases, the parser is only using once the RDF file, so the implementation is be the same for File (String), InputStream or Iterator of TripleString.
Write triples/merge sections
For each triple, we will assign a new id to each node ((s, sid), (p, pid), (o, oid)), we attach to the component these ids at the same time it sort the components to 3 sections files with a merge sort. The ids are the number of the triple, so we don't need to store the triples.
At the end, we have 3 files of sorted compressed section file with an id attached to each strings (node, node_id) for subject, predicate, object
Create sections/id map files
With the raw triples, we create the 4 sections, removing the duplicates and get shared elements.
At the same time we fill 3 map files (SequenceLog64BigDisk) to be able to map the initial node id (sid, pid or oid) to the position in one of the 4 sections, we are using Sequence to reduce the disk usage of the maps.
We mark duplicate with the 1st bit and shared element with the 2nd bit, the other bits are the id in section for non duplicates, id of the original for duplicate.
So for example, if we have:
0b1100
-> Non shared element with index 3 (0b11
)0b1101
-> Shared element with index 3 (0b11
)0b1101
-> Duplicate element, the section index is in the map at the index 3 (0b11
).The dictionary is completed.
Map triples with section IDs/merge triples
During the first step, we have created nodes of the sections with incremental ids 1..numTriples, so we simply need to use the maps to map them using the maps created during the second step and sort them with merge sort.
Create triples
With the triple sorted, we can create the bitmap of the triples.
The triples are completed.
Create header
Simply create the header with the Dictionary/Triples parts, the original size isn't computed the same way as the generateHDT memory method, so the value can differ.
The Header and HDT is completed
Options
Findable with HDTOptionsKeys, the generate method can be config with multiple option keys
LOADER_DISK_COMPRESSION_MODE_KEY
Change thesort method, can be 2 values:
LOADER_DISK_COMPRESSION_WORKER_KEY
(For complete sort only)
The maximum count of workers to merge the files
LOADER_DISK_CHUNK_SIZE_KEY
The maximum size of a chunk to merge sort, by default it is 85% of 1 third of the allocated RAM.
LOADER_DISK_LOCATION_KEY
Set the working directory, by default it is set into a temporary folder, will be mkdirs before and delete after usage.
LOADER_DISK_FUTURE_HDT_LOCATION_KEY
Set the future HDT location, if this value is set, the method will generate the HDT file and map it, it reduces the RAM usage, by default the method will load into memory the HDT without creating a file.
Tests
To test this method, I'm generating 2 HDT with generateHDT and generateHDTDisk with map/load or partial/complete sort and check the equality of the 2 HDTs.
Some other tests are also present to test the writer/reader of in compression files and the mapping.
HDT Java Command line Tools Changes
Two new parameters were added to the rdf2hdt tool:
-disk
- Specify we want to use the generateHDTDisk version and not the generateHDT to create the HDT-disklocation [workingLocation]
- Specify the working directory, shortcut for theLOADER_DISK_LOCATION_KEY
option.-cattree
- Specify we want to use the cattree version and not the generateHDT to create the HDT-cattreelocation [workingLocation]
- Specify the working directory, shortcut for theLOADER_DISK_LOCATION_KEY
option.-printoptions
- Print all the config for the HDT-color
- use colors in the console-multithread
use multi thread logsFor the disk/tree generation, the new
MultiThreadListener
is used.For HDTVerify, it works with MSC, it also verify for duplicated elements and print the current section. I've added the
hdtVerify.bat
file to use hdtVerify on Windows