PhD Position (CIFRE): Relational query optimization for multidimensional data

Offre de thèse (CIFRE): Optimisation de requêtes relationnelles pour les données multidimensionnelles


Context

Query performance on analytic workloads is heavily influenced by the quality of the query optimizer-one which has prompted several decades of research and advancement on the different components of the query optimizer.  When it comes to query optimization and plan cost estimation, no one-size fits all [2]. This is usually because some relational operators and their corresponding enumeration algorithms are often vendor-specific, complexity of the analytic workload, and the underlying system architecture are also some contributing factors.

Multidimensional data models are very popular in scientific processing and machine learning workloads [3] as they help capture variable data.  Some of the prevalent data types include arrays and dictionaries or maps and are also now becoming a serious part of financial workload. Query evaluation on multidimensional workloads often involves multiple levels of aggregation over these sets of data. A relevant work in this sphere described in [4] provides a technique for lazily evaluating aggregates by fusing group by and join operator. In [5], the authors proposed a set of low-level plan operators for SQL-style statistical expressions that modularizes aggregate implementations whenever multiple aggregates are combined.

Work in this area provides the foundation for query evaluation on multidimensional data but additional research that focuses on the application of more robust optimization techniques and their cost analysis are the goal of this research.

Research Objectives

The primary focus of this work is to develop optimization techniques for queries on multidimensional workload. During the first phase of this research, the candidate will conduct a thorough state-of-art study that focuses on cost model and optimization techniques applicable to this domain. In the next phase, an investigation of query constructs that are a source of bottleneck for query evaluation will be studied. This means in particular identifying the most important logical constructs, and their frequent combinations in practice that are the most interesting to optimize.

Candidate’s Profile

The ideal candidate for this role must possess an MSc in Computer Science or closely related fields. The candidate must be proactive and highly-motivated to carry out advance research, and with a well-developed analytical problem-solving ability. Good understanding of functional programming in Scala (or willingness to learn) is required.

About the Team

This research will be carried out between Opensee and Tyrex team at Inria Grenoble Rhône-Alpes.

Opensee (opensee.io) is a fintech company with headquarters in Paris, offering instant and self-service analytics to financial institutions, helping them better respond to regulatory and business requirements and turn their big data challenges into competitive advantage. The Core Engine team at Opensee, research, experiment, maintain, and develop features for the query engine which is at the core of data processing within Opensee.

The Tyrex team (tyrex.inria.fr) is affiliated with CNRS LIG, Inria, UGA, and Grenoble INP; and located in Montbonnot near Grenoble in France. The Tyrex research group focuses on the foundations of the next generation of data analytics and data-centric programming systems and has produced many output in research and industrial applications. The candidate will graduate from the University of Grenoble Alpes (UGA).

Type of position: 3 years contract.

How to apply
The candidate should send an email to Pierre Genevès (
pierre.geneves@inria.fr) and Nabil Layaïda (nabil.layaida@inria.fr) with an application file composed of:

- a detailed CV

- a motivation letter

- academic transcript (relevé de notes)

References

  1. Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How good are query optimizers, really?. _Proceedings of the VLDB Endowment_, _9_(3), 204-215.
  2. Siddiqui, T., Jindal, A., Qiao, S., Patel, H., & Le, W. (2020, June). Cost models for big data query processing: Learning, retrofitting, and our findings. In _Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data_ (pp. 99-113).
  3. Xing, H., & Agrawal, G. (2019, July). Accelerating array joining with integrated value-index. In _Proceedings of the 31st International Conference on Scientific and Statistical Database Management_ (pp. 145-156).
  4. Fent, P., & Neumann, T. (2021). A practical approach to groupjoin and nested aggregates. _Proceedings of the VLDB Endowment_, _14_(11), 2383-2396.
  5. Kohn, A., Leis, V., & Neumann, T. (2021, June). Building advanced sql analytics from low-level plan operators. In Proceedings of the 2021 International Conference on Management of Data (pp. 1001-1013).