RDFHive


A Distributed RDF Store Based on top of Apache Hive

Overview: RDFHive is a simple tool that allows to query RDF datasets using Apache Hive. Data already loaded on the Hadoop Distributed File System (HDFS) is then stored in the Hive Warehouse throught a single three-column table (respectively subject, predicate, object fields). Such a storage model can offer an instant data load stage. Then, data can be queried using the Hive relational language called Hive-QL.

Dependencies

How to use it?

In this package, we simply provide bash scripts to load and query RDF datasets in the Hive warehouse. We also present a test-suite where two famous RDF/SPARQL benchmarks can be runned : Lubm and Watdiv. For space reasons, these two datasets only contain a few hundred of thousand RDF triples.

1. Get the sources:

wget tyrex.inria.fr/rdfhive/rdfhive.1.0.tar ;
tar -xvf rdfhive.1.0.tar ;
cd rdfhive.1.0/ ;

2. Load an RDF dataset: Originally, RDFHive can only load RDF data written according to the N-Triples format; however, as many datasets come in other standards (e.g. RDF/XML...) we also provide a .jar file from an external developer able to translate RDF data from a standard to an other one.

hadoop fs -copyFromLocal /LOCAL/PATH/OF/NTRIPLES HDFS/PATH/OF/NTRIPLES ;
bash rdfhiveload.sh DATABASE_NAME /HDFS/PATH/OF/NTRIPLES ;

3. Execute a SPARQL query: To execute a SPARQL query over an RDF dataset loaded into RDFHive, users first need to translate it into Hive-QL (SQL-like) since Hive is a relational data management system on-top of the HDFS. Then, the following comand launches the execution:

bash rdfhivequery.sh /LOCAL/PATH/HiveQL/QUERY ;

4. Run the test-suite:

cd tests/ ;
bash run-lubm.sh ;
bash run-watdiv.sh ;

License

This project is under the CeCILL license.

Authors

Contributors
Damien Graux
damien DOT graux AT inria DOT fr

Pierre Genevès
Nabil Layaïda
From the following institutions:
Inria
Cnrs
LIG
UGA
Partially funded by:
Datalyse Project