RDFHive
A Distributed RDF Store Based on top of Apache Hive
Overview: RDFHive is a simple tool that allows to query RDF datasets using Apache Hive. Data already loaded on the Hadoop Distributed File System (HDFS) is then stored in the Hive Warehouse throught a single three-column table (respectively subject, predicate, object fields). Such a storage model can offer an instant data load stage. Then, data can be queried using the Hive relational language called Hive-QL.
Dependencies
- Hadoop (+HDFS) version 2.6.0-cdh5.7.0
- Apache Hive version 1.1.0-cdh5.7.0
How to use it?
In this package, we simply provide bash scripts to load and query RDF datasets in the Hive warehouse. We also present a test-suite where two famous RDF/SPARQL benchmarks can be runned : Lubm and Watdiv. For space reasons, these two datasets only contain a few hundred of thousand RDF triples.
1. Get the sources:
wget tyrex.inria.fr/rdfhive/rdfhive.1.0.tar ; tar -xvf rdfhive.1.0.tar ; cd rdfhive.1.0/ ;
2. Load an RDF dataset: Originally, RDFHive can only load RDF data written according to the N-Triples format; however, as many datasets come in other standards (e.g. RDF/XML...) we also provide a .jar file from an external developer able to translate RDF data from a standard to an other one.
hadoop fs -copyFromLocal /LOCAL/PATH/OF/NTRIPLES HDFS/PATH/OF/NTRIPLES ; bash rdfhiveload.sh DATABASE_NAME /HDFS/PATH/OF/NTRIPLES ;
3. Execute a SPARQL query: To execute a SPARQL query over an RDF dataset loaded into RDFHive, users first need to translate it into Hive-QL (SQL-like) since Hive is a relational data management system on-top of the HDFS. Then, the following comand launches the execution:
bash rdfhivequery.sh /LOCAL/PATH/HiveQL/QUERY ;
4. Run the test-suite:
cd tests/ ; bash run-lubm.sh ; bash run-watdiv.sh ;
License
This project is under the CeCILL license.
Authors
Contributors Damien Graux damien DOT graux AT inria DOT fr Pierre Genevès Nabil Layaïda |
From the following institutions: Inria Cnrs LIG UGA |
Partially funded by: Datalyse Project |