Dissertation theses
Fast and robust data-analysis pipelines
Data analysis is typically performed by composing a series of discrete tools and libraries into a data analysis pipeline. These pipelines are at the core of data-driven science that has been central to most disciplines and today see an explosion in the widespread use of computational methods and available data. As the number of tools and size of data keep growing, we face problems with the scalability of the pipelines and the trustworthiness of their results.
The goal of this work is to research ways to make data analysis pipelines scalable (accommodate growing data and computational needs) and trustworthy (facilitate auditing of the analysis result). The research will go along two axes. The first will focus on extending the R programming language with transparent horizontal and vertical scaling. The second will study a combination of static and dynamic program analysis techniques to gain insight into the nature and severity of programming errors in the code of data-analysis pipelines, and propose algorithms for their detection and possible automated repair.
Seeing Through the FFI: Analyzing Native Code in Dynamic Languages
Specialist supervisor: Mgr. Tomáš Petříček, Ph.D.
Dynamic programming languages such as R, Python, or Ruby frequently rely on extensive collections of built-in functions implemented in native languages like C or C++. For example, R has close to a thousand builtin functions written in C and Fortran just in its base package. This reliance arises from a persistent performance gap between high-level dynamic languages and low-level compiled ones. To mitigate this gap, language implementers and library authors rely on the foreign-function interface (FFI), which allows performance-critical functionality to be implemented natively and invoked from the dynamic language runtime.
While FFIs are essential for achieving acceptable performance, they also introduce a fundamental barrier to program analysis and optimization. Native code, typically written in C or C++, is opaque to the compiler and analysis tools of the host dynamic language. As a result, compilers cannot reason about the semantics, side effects, or data dependencies of these native functions. This severely limits the scope of optimizations and hinders any attempt at whole-program reasoning across the boundary between the high-level language and its native extensions.
This PhD project aims to systematically analyze and model built-in and FFI-based native functions to bridge this semantic gap. The first objective is to develop techniques for extracting or approximating behavioral information about such functions—including their purity, control effects, and data transformations. This knowledge can then be leveraged to enhance static and dynamic analyses, improve optimization pipelines, and increase the safety and predictability of native interoperation.
Building on this foundation, the research will explore approaches for raising the level of abstraction of built-ins—that is, expressing their semantics in a form that can be understood and optimized by the host compiler. Ultimately, the goal is to bring native functionality closer to the source language, enabling whole-program analysis and optimization.