prof. Jan Vitek, MSc., Ph.D.

Projects

Big Code: Scalable Analysis of Massive Code Bases

Program
Operational Programme – Research, Development and Education – Structural Funds EU
Provider
European Commission
Code
EF15_003/0000421, CZ.02.1.01/0.0/0.0/15_003/0000421
Period
2019 - 2022
Description
The project aims to create at the FIT CTU the Institute of Scalable Code Analytics (ISCA), the first research centre in the CR focused on analyses of large code bases available on the Internet. Software systems are written in source code; BigCode refers to the massive codebases on the Internet. Combining techniques from programming languages and statistical machine learning will allow the mining of these codebases for crucial insights. The requested funds will be invested, in part, to provide the FIT with the first hardware and software infrastructure for big code data analysis. The other part of the research funding will attract internationally renowned researchers in the field of computer languages. The team is synergistic with existing research capacities at the FIT in software and knowledge engineering, data mining and parallel computing. The new team is well connected internationally and will bring investment from leading industrial partners that include Google and Oracle.

Evolving Language Ecosystems

Program
Horizon 2020
Provider
European Commission
Code
695412
Period
2016 - 2022
Description
The Evolving Language Ecosystems project explores the fundamental techniques and algorithms for evolving programming languages and their ecosystems. Our purpose is to reduce the cost of wide-ranging language changes and obviate the need for devising entirely new languages. Our findings will grant both researchers and practitioners a greater degree of freedom when experimenting with new ideas on how to express computation.

Reproducible Data Analysis for All

Program
Horizon Europe
Provider
European Commission
Code
101081989-R4R
Period
2024 - 2025
Description
Creating a reproducible environment for data-analysis pipelines is hard. The current practice is to assemble it manually. That is both labor-intensive and error-prone and requires skills and knowledge that data analysts do not usually have. While there exist tools that try to simplify this process, they all rely on some metadata that has to be provided by the user. Getting this metadata is not trivial. Not only does one have to include all the libraries directly imported with their transitive dependencies, but each of these libraries can depend on native libraries and tools, which themselves have their dependencies and configurations. Versions have to be pinned appropriately as libraries frequently update and change their behavior. There is no description of these dependencies, and thus the process of gathering the metadata is mainly based on experience and trial-and-error. The challenge that we are addressing is to build an automated system that can track all of the pipeline dependencies, data inputs, and other sources of non-determinism to prepare an environment where data-analysis pipelines can repeatedly run, producing identical results.

Rigorous Engineering of Data Analysis Pipelines (RiGiD)

Program
Grantové projekty excelence v základním výzkumu EXPRO
Provider
Czech Science Foundation
Code
GX23-07580X
Period
2023 - 2027
Description
The RiGiD project lays the groundwork for this research programme and aims to develop a methodology for rigorous engineering of data analysis pipelines that can be adopted in practice. Our approach is pragmatic. Rather than chasing functional correctness, we hope to substantially reduce the incidence of errors in the wild. The research is structured in three overlapping chapters. First, identify the problem by carrying out user studies and large-scale program analysis of a corpus of over 100,000 data science pipelines. The outcome will be a catalog of error patterns as well as a labeled dataset to be shared with other researchers. The technical advances will focus on combining dynamic and static program analysis to approximate the behavior of partial programs and programs written in highly dynamic languages. The second part of our effort proposes a methodology and tooling for developing data sciences codes with reduced error rates. The technical contributions of this part of the project focus on lightweight specification techniques and, in particular, the development of a novel gradual typing system that deals with common programming idioms found in our corpus. This includes various forms of object orientation, data frames, and rich value specifications. These specifications are complemented with an automated test generation technique that combines test and input synthesis with fuzzing and test minimization. Finally, the execution environment is extended to support automatic reproducibility and result audits through data lineage. The third and last part of the work evaluates the proposal by conducting user studies and developing tools for automating deployment. The contribution will be a qualitative and quantitative assessment of the RiGiD methodology and tooling. The technical contribution will be tools that leverage program analysis to infer approximate specifications to assist deployment and adoption. Our tools target R, a language for data analytics with 2 milli