Cost Models for Distributed Programming in Spark
Programming systems that manipulate large data volumes raise many challenges. As the amount of data being processed grows beyond single computer capabilities, the problem of effectively programming multiple computers, each possibly with multiple CPUs, virtual nodes and storage subsystems, becomes a major challenge [1, 2, 3, 4, 5].
The goal of this research topic is to come up with a formal characterizationof the complexity of the execution of a distributed program written for a cluster-computing framework such as Spark. The characterization should reflect resources consumed (e.g. CPU, memory, network) for executing distributed operations in terms of the size and distribution of the input data. The overall objective is to build metrics adapted for the purpose of statically comparing and ranking programs according to their runtime performance. The metric should rank better programs that minimize the overall consumption of cluster resources in a way that improves running time. This includes modeling and analyzing the total number of steps, stages, the maximum number of steps performed by one worker, the balance of the number of steps executed on workers, the total amount of data received/sent by each worker, in relation with the total data size and the initial data distribution. Contributions will be achieved using modeling, semantics, type systems and compiling theory. The expected results are foundations for a language, an optimizing compiler, as well as their concrete implementations. The relevance of the approach will be further assessed with an experimental evaluation. This research topic is part of the CLEAR project (http://tyrex.inria.fr/clear/) in which we investigate the synthesis of code optimized for distributed infrastructures.
The Tyrex team is a common research team (CNRS, Inria, UGA, Grenoble INP) located in the Inria Grenoble Rhône-Alpes research center. The Tyrex team has developed methods for query analysis (see e.g. our logical solver) and a system for the distributed evaluation of queries by compilation into code executed on Apache Spark (see e.g. the SPARQLGX system). The candidate will benefit from the developments made by the team. He will be encouraged to develop his own ideas, and will be provided with the means to experiment if he wants. This is a topic in which we can easily set a custom balance between theoretical and applied research.
Algorithm Design: Parallel and Sequential.
Umut A. Acar and Guy E. Blelloch.
Minimal MapReduce Algorithms.
Yufei Tao, Wenqing Lin, Xiaokui Xiao.
Querying Big Data: Bridging Theory and Practice.
Fan, W. & Huai, JP.
J. Comput. Sci. Technol. (2014) 29: 849.
Program-Centric Cost Models for Locality and Parallelism.
Harsha Vardhan Simhadri.
PhD Thesis, CMU, September 2013.
What should a theory of big data say?
Informal discussion, 2013.