Skip to content

Latest commit

 

History

History
 
 

hdt-java-package

HDT Library, Java Implementation. http://www.rdfhdt.org

Overview
=================

HDT-lib is a Java Library that implements the W3C Submission (http://www.w3.org/Submission/2011/03/) of the RDF HDT (Header-Dictionary-Triples) binary format for publishing and exchanging RDF data at large scale. Its compact representation allows storing RDF in fewer space, providing at the same time direct access to the stored information. This is achieved by depicting the RDF graph in terms of three main components: Header, Dictionary and Triples. The Header includes extensible metadata required to describe the RDF data set and details of its internals. The Dictionary organizes the vocabulary of strings present in the RDF graph by assigning numerical IDs to each different string. The Triples component comprises the internal structure of the RDF graph in a compressed form.

It provides several components:
- hdt-java-api: Abstract interface for dealing with HDT files.
- hdt-java-core: Core library for accessing HDT files programmatically from java. It allows creating HDT files from RDF and converting HDT files back to RDF. It also provides a Search interface to find triples that match a specific triple pattern.
- hdt-java-cli: Commandline tools to convert RDF to HDT and access HDT files from a terminal.
- hdt-jena: Jena integration. Provides a Jena Graph implementation that allows accessing HDT files as normal Jena Models. In turn, this can be used with Jena ARQ to provide more advanced searches, such as SPARQL, and even setting up SPARQL Endpoints with Fuseki.

Command line tools
==================

NOTE: Generating and loading big HDT Datasets can take a considerable amount of memory. In some cases you might need to increase the JVM heap memory using -Xmx flag according to your machine's memory in the bin/javaenv.{sh,bat} launch scripts.

The tool provides three main command line tools:

** The tool rdf2hdt converts an RDF file to HDT format. The format of the input file will be NTriples by default, although it can be set by using the "-f" flag.
** NOTE: rdf2hdt decompresses on the fly input files that end with ".gz".
$ Usage: rdf2hdt [options] <Input RDF> <Output HDT>
  Options:
    -base      Base URI for the dataset
    -config    Conversion config file
    -options   HDT Conversion options
    -rdftype   Type of RDF Input (ntriples, n3, rdfxml)


** The tool hdt2rdf converts an HDT file back to RDF in NTriples format.
$ hdt2rdf <HDT input file> <RDF output file>


** The tool hdtSearch allows to search triple patterns against an HDT file. For example, to list all patterns, one can use the "? ? ?" query. To search all information about <myns:subject1> one can use "<myns:subject1> ? ?"
$ hdtSearch <HDT File>

** You can also search SPARQL queries as
$ hdtsparql <HDT File> <SPARQL Query>

** The tool hdtInfo extracts the header from an HDT file.
$ hdtInfo <HDT File>


Tool Usage example
==================

After installation, run:

# To convert an RDF file to HDT
$ rdf2hdt -base "http://example.org/test" data/test.nt data/test.hdt

# To extract the Header of the generated HDT
$ hdtInfo data/test.hdt
<http://example.org/test> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "<http://purl.org/HDT/hdt#Dataset>" .
<http://example.org/test> <http://rdfs.org/ns/void#triples> "10" .
.....

# To convert back the HDT to RDF.
$ hdt2rdf data/test.hdt data/test.hdtexport.nt

# To search against an HDT file.
$ hdtSearch data/test.hdt

>> ? ? ?
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri4
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri5
http://example.org/uri4 http://example.org/predicate4 http://example.org/uri5
http://example.org/uri1 http://example.org/predicate1 "literal1"
http://example.org/uri1 http://example.org/predicate1 "literalA"
http://example.org/uri1 http://example.org/predicate1 "literalB"
http://example.org/uri1 http://example.org/predicate1 "literalC"
http://example.org/uri1 http://example.org/predicate2 http://example.org/uri3
http://example.org/uri1 http://example.org/predicate2 http://example.org/uriA3
http://example.org/uri2 http://example.org/predicate1 "literal1"
9 results shown.

>> http://example.org/uri3 ? ?
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri4
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri5
2 results shown.

>> exit

# Execute SPARQL Query against the file.
$ ./hdtsparql.sh data/test.hdt "SELECT ?s ?p ?o WHERE { ?s ?p ?o . }"

NOTE: Searching requires an additional .hdt.index that is automatically created in the same directory as your original HDT file. The first time you load an HDT it will take a while to generate it, the next times it will load much faster. If you are not using an HDT, you can safely delete its .hdt.index without losing any information. It will be automatically regenerated next time you use it.


USING THE LIBRARY
=================

Here we show how to use the library programmatically from java. This code can also be found under the examples directory.

First you need to add the hdt jars, typically hdt-api-VERSION.jar and hdt-java-core-VERSION.jar

If you use maven, you can get the packages automatically from Maven Central by adding the dependency:

  	<dependency>
  		<groupId>org.rdfhdt</groupId>
  		<artifactId>hdt-java-core</artifactId>
  		<version>1.1</version>
  	</dependency>

If you want to use a Snapshot version (with the most recent unstable version), You need to add the Sonatype snapshot repository to your ~/.m2/settings.xml as follows:

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
                          http://maven.apache.org/xsd/settings-1.0.0.xsd">

<!-- Your preferences -->

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

</settings>


Once that you have the HDT libraries ready, you can use the following Java code to generate an HDT file:

// Generate an HDT File from RDF (examples/ExampleGenerate.java)
public static void main(String[] args) throws Exception {
	// Configuration variables
	String baseURI = "http://example.com/mydataset";
	String rdfInput = "/path/to/dataset.nt";
	String inputType = "ntriples";
	String hdtOutput = "/path/to/dataset.hdt";

	// Create HDT from RDF file
	HDT hdt = HDTManager.generateHDT(rdfInput, baseURI, RDFNotation.parse(inputType), new HDTSpecification(), null);

	try {
		// Save generated HDT to a file
		hdt.saveToHDT(hdtOutput, null);
	} finally {
		// IMPORTANT: Free resources
		hdt.close();
	}
}

And the following to open an HDT file and use it.

// Load an HDT and perform a search. (examples/ExampleSearch.java)
public static void main(String[] args) throws Exception {
	// Load HDT file. NOTE: Use loadHDT() if you don't need ?P?, ?PO or ??O queries
	HDT hdt = HDTManager.mapIndexedHDT("data/example.hdt", null);

	try {
		// Search pattern: Empty string means "any"
		IteratorTripleString it = hdt.search("", "", "");
		while(it.hasNext()) {
			TripleString ts = it.next();
			System.out.println(ts);
		}
	} finally {
		// IMPORTANT: Free resources
		hdt.close();
	}
}


To use an HDT file from Jena, make sure that you include hdt-jena-VERSION.jar as well as the HDT core libraries, then:

// Load HDT file using the hdt-java library
HDT hdt = HDTManager.loadIndexedHDT("path/to/file.hdt", null);

// Create Jena Model on top of HDT.
Model model = ModelFactory.createModelForGraph(graph);

// Use Jena Model as Read-Only data storage, e.g. Using Jena ARQ for SPARQL.


Getting the Sources
=================

The sources for the implementation are available in our Google Code repository: http://code.google.com/p/hdt-java


License
===============

Each module has a different License. Core is LGPL, examples and tools are Apache.

hdt-api: Apache License
hdt-java-cli (Commandline tools and examples): Apache License
hdt-java-core: Lesser General Public License
hdt-jena: Lesser General Public License
hdt-fuseki: Apache License


Authors
===============

Mario Arias <mario.arias@gmailcom>
Javier D. Fernandez <[email protected]>
Miguel A. Martinez-Prieto <[email protected]>


Acknowledgements
================

RDF/HDT is a project developed by the Insight Centre for Data Analytics (www.insight-centre.org), University of Valladolid (www.uva.es), University of Chile (www.uchile.cl). Funded by Science Foundation Ireland: Grant No. SFI/08/CE/I1380, Lion-II; the Spanish Ministry of Economy and Competitiveness (TIN2009-14009-C02-02); and Chilean Fondecyt's 1110287 and 1-110066.