Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added array of arrays implementation of dictionary #5

Merged
merged 1 commit into from
Jun 27, 2016

Conversation

mielvds
Copy link
Member

@mielvds mielvds commented May 12, 2016

@luda171 updated java-hdt to be able to process bigger files with 64 bit address space.
The problems with java-hdt were related to the fact that java's native array types use 32-bit int for length and indexing.
This means the maximum length of each array is (2 power of 31 -1) ~ 2 billion elements.
Also other java classes like ByteArrayOutputStream etc depend on arrays in the internal implementation, so they are fast becoming out of capacity.

All changes are in the hdt-java-core .
The patch consists of update of PFCDictionarySectionBig.java where I implemented 2 missed methods : load and save. I used the same array of array approach here as in the other methods already was present in this class..
Also I added SequenceLog64Jarray.java where I ported code from SequenceLog64.java with LongLargeArray class from pl.edu.icm.jlargearrays library.

Right now the default dictionary is still the FourSectionDictionary.java I added FourSectionDictionaryBig.java which is based on two classes discussed above.
The FourSectionDictionaryBig.java can be invoked with dictionary.type = dictionaryFourBig

I made example config file hdt.cfg in hdt-java-core . After changing
hdt-java/hdt-java-core/src/main/java/org/rdfhdt/hdt/example/ExampleGenerate.java to your local parameters you can test it with maven /usr/bin/mvn -e exec:java

Also, just for reference to set java heap for maven

export MAVEN_OPTS=" -Xmx100G "

@mielvds
Copy link
Member Author

mielvds commented May 12, 2016

First thing that comes to mind: is this simply an improvement or does this introduce breaking changes for some cases? If the former, updating SequenceLog64 & FourSectionDictionary seems better than introducing new classes.

@luda171
Copy link

luda171 commented May 12, 2016

Yes, it can be made default behavior. I did it separate for now, so people
can compare and test. I did test it on wikipedia 3.5 and 3.8.

On Thu, May 12, 2016 at 1:44 AM, mielvds [email protected] wrote:

First thing that comes to mind: is this simply an improvement or does this
introduce breaking changes for some cases? If the former, updating
SequenceLog64 & FourSectionDictionary seems better than introducing new
classes.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#5 (comment)

@mielvds
Copy link
Member Author

mielvds commented Jun 27, 2016

@bendiken I'll revise and merge this if no objections. @bendiken your opinion?

@artob
Copy link
Contributor

artob commented Jun 27, 2016

@mielvds Go ahead. Note also the plethora of opened pulls in the last few days.

@mielvds
Copy link
Member Author

mielvds commented Jun 27, 2016

Going through them as we speak

@mielvds mielvds merged commit ea3ca7b into rdfhdt:master Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants