Master Internship proposal:
Fast evaluation of SPARQL queries on distributed systems
In the recent years, we have seen the development of various "distributed frameworks". These frameworks take care of all the low-level machinery needed to distribute computation: networking, file storage, data distribution and replication, distribution of tasks, etc. The frameworks offer to the programmer a set of parallel primitives which they can use to compose their distributed programs. As notable examples of such frameworks we can cite: Apache Hadoop (based on Map-Reduce); Spark and Flink.
The advantages of using a framework are twofold: the users can spend more time focusing on the higher-level part of the problem they are trying to solve and they profit from all the optimizations performed by the distributed framework.
SPARQL is the W3C standard query language (see ) for querying data expressed in RDF (Resource Description Framework). The increasing amounts of RDF data available raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators.
SPARQLGX is a distributed RDF datastore based on Apache Spark developed in the Tyrex team (avalaible on github). SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries. SPARQLGX relies on a simple translation of SPARQL queries into executable Spark code that adopts evaluation strategies according to general heuristics and statistics on data. Using a simple design, SPARQLGX already represents an interesting alternative by outperforming competitors on large (billons of triples) datasets (see ).
InternshipThe goal of this internship is to improve SPARQLGX. The intern will be free to develop their own ideas but there are several paths ready to be investigated:
- On many queries, the time spent reading data on the disk represents a large portion of the total query time. Since the graph nodes are URIs that contain a lot of redundant data, compressing URIs could lead to performance improvements.
- As of today, SPARQLGX uses a simple vertical partitioning scheme (see ) that splits the dataset into as many pieces as there are different predicates. This scheme can be refined e.g. by grouping rare predicates together or splitting very frequent ones.
- For the moment, SPARQLGX translates the given query into a join order and then tries to optimize this join order. Therefore some plans are never considered. It would be best to consider all (when possible) plans and then select the best one (see  or a similar idea in ).
- The distributed framework Spark offers several types of join. In the current version, our translation only uses the "Hash-Shuffle-Join"; it would be interesting to detect the cases where other types of join might improve SPARQLGX performance.
- The distributed framework Flink has an approach slightly different from Spark but offers the same basic parallel primitives. It would be interesting to include Flink as a possible backend for SPARQLGX.
Applicants should have a programming background (a good knowledge of at least one programming language) and a taste for compilation and optimization.
The current SPARQLGX mixes a lot of various technologies and languages: OCaml (including Ocamllex and menhir), Scala, bash, Spark, Hadoop, Cloudera, etc. We don't expect applicants to have experience with all of them but a knowledge of some of them is a plus.
SPARQL 1.1 overview (March 2013)
SPARQLGX: Efficient Distributed Evaluation of SPARQL with
Damien Graux, Louis Jachiet, Pierre Genevès and Nabil Layaïda
Cliquesquare: Flat plans for massively parallel RDF
François Goasdoué, Zoi Kaoudi, Ioana Manolescu
Scalable semantic web data management using vertical partitioning
DJ Abadi, A Marcus, SR Madden
Postgres planner optimizer