SPARQLGX
A Distributed RDF Store Mapping SPARQL to Spark
Overview: SPARQL is the W3C standard query
language for querying data expressed in the Resource
Description Framework (RDF). The increasing amounts of RDF
data available raise a major need and research interest in
building efficient and scalable distributed SPARQL query
evaluators.
In this context, we propose and share
SPARQLGX: our implementation of a distributed RDF datastore
based on Apache Spark. SPARQLGX is designed to leverage
existing Hadoop infrastructures for evaluating SPARQL
queries. SPARQLGX relies on a translation of SPARQL queries
into executable Spark code that adopts evaluation strategies
according to (1) the storage method used and (2) statistics
on data. Using a simple design, SPARQLGX already represents
an interesting alternative in several scenarios.
Requirements
- Apache Hadoop (+HDFS) version 2.6.0-cdh5.7.0
- Apache Spark version 1.6.0
- OCaml version 4.02.2
How to use it?
In this package, we provide sources to load and query RDF datasets with SPARQLGX and S-DE (a SPARQL direct evaluator). We also present a test-suite where two famous RDF/SPARQL benchmarks can be runned : Lubm and Watdiv. For space reasons, these two datasets only contain a few hundred of thousand RDF triples.
1. Get the sources and compile:
wget tyrex.inria.fr/sparqlgx/sparqlgx.1.0.tar ; tar -xvf sparqlgx.1.0.tar ; cd sparqlgx.1.0/ ; bash dependencies.sh ; bash compile.sh ;
2. Load an RDF dataset: SPARQLGX can only load RDF data written according to the N-Triples format; however, as many datasets come in other standards (e.g. RDF/XML...) we also provide a .jar file from an external developer able to translate RDF data from a standard to an other one.
hadoop fs -copyFromLocal local_file.nt hdfs_file.nt ; hadoop fs -mkdir dataset_hdfs_dir/ ; spark-submit --class "Load" bin/sparqlgx-load-0.1.jar hdfs_file.nt dataset_hdfs_dir/ ;
3. Execute a SPARQL query: To execute a SPARQL query over a loaded RDF dataset, users first need to translate it into executable Spark-Scala-code. Then, the Spark engine is able to evaluate the generated sequence. The following comand summarizes this entire process:
bash bin/sparqlgx-eval.sh local_query.rq dataset_hdfs_dir/ ;
4. Use the additional tools: Moreover, we give two supplementary modules as extensions.
- A SPARQL query re-writer based on RDF data statistics. Users first have to compute statistics; and then SPARQL query clauses (Triple Patterns in the WHERE{...}) can be re-ordered according to the data repartition.
- A SPARQL Direct Evaluator (S-DE) i.e. no loading phase needed. S-DE directly evaluates SPARQL queries of RDF datasets saved on the HDFS.
bash bin/generate-stat.sh hdfs_file.nt output_local_stat.txt # Generate Statistics bash bin/order-with-stat.sh local_query.rq local_stat.txt # Use Statistics
bash bin/sparqlgx-direct-eval.sh local_query.rq hdfs_file.nt ;
5. Run the test-suite:
cd tests/ ; bash run-me-first.sh ; # Creates a workspace on the HDFS bash run-sde-watdiv.sh ; # Directly evaluates L1 on watdiv.nt bash run-sparqlgx-lubm.sh ; # Loads lubm.nt and executes Q1 bash run-sparqlgx-watdiv.sh ; # Loads watdiv.nt and executes C1 bash run-statistics.sh ; # Generates statistic files for the two datasets
License
This project is under the CeCILL license.
Authors
|
Contributors Damien Graux damien DOT graux AT inria DOT fr Louis Jachiet Pierre Genevès Nabil Layaïda |
From the following institutions: Inria Cnrs LIG UGA |
Partially funded by: Datalyse Project |