prof. Jan Vitek, MSc., Ph.D.

Big Code: Škálovatelná analýza rozsáhlých bází programů

Program

OPVVV - Operační program Výzkum, vývoj a vzdělávání - Strukturální fondy EU

Poskytovatel

Evropská komise

Pracoviště

Laboratoř výzkumu programování

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

EF15_003/0000421, CZ.02.1.01/0.0/0.0/15_003/0000421

Období

2019 - 2022

Popis

V rámci projektu BigCode bude na Českém vysokém učení technickém v Praze, Fakultě informačních technologií (FIT) vytvořen Institut pro škálovatelnou analýzu kódu (ISCA), první výzkumné centrum v ČR zaměřené na analýzu rozsáhlých bází programových kódů na internetu, které představují obrovský znalostní potenciál, který ale neumíme zatím využít. Projekt BigCode si klade za cíl tuto bázi pomocí technik programovacích jazyků a statistického strojového učení zanalyzovat a umožnit porozumění získaným informacím. Požadované prostředky budou použity na vybudování hardwarové a softwarové infrastruktury centra a na mzdové prostředky pro mezinárodně uznávané výzkumné pracovníky. Výzkumná náplň projektu je v souladu s již existujícím výzkumem na FIT -- softwarové a znalostní inženýrství, vytěžování dat, paralelní výpočty. Díky mezinárodním kontaktům členů výzkumného týmu projektu BigCode centrum následně předpokládá získávání podpory od předních softwarových firem, jako jsou Google a Oracle.

Evolving Language Ecosystems

Program

Horizon 2020

Poskytovatel

Evropská komise

Pracoviště

Katedra teoretické informatiky
Laboratoř výzkumu programování

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

695412

Období

2016 - 2022

Popis

The Evolving Language Ecosystems project explores the fundamental techniques and algorithms for evolving programming languages and their ecosystems. Our purpose is to reduce the cost of wide-ranging language changes and obviate the need for devising entirely new languages. Our findings will grant both researchers and practitioners a greater degree of freedom when experimenting with new ideas on how to express computation.

Reproducible Data Analysis for All

Program

Horizon Europe

Poskytovatel

Evropská komise

Pracoviště

Laboratoř výzkumu programování

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

101081989-R4R

Období

2024 - 2025

Popis

Creating a reproducible environment for data-analysis pipelines is hard. The current practice is to assemble it manually. That is both labor-intensive and error-prone and requires skills and knowledge that data analysts do not usually have. While there exist tools that try to simplify this process, they all rely on some metadata that has to be provided by the user. Getting this metadata is not trivial. Not only does one have to include all the libraries directly imported with their transitive dependencies, but each of these libraries can depend on native libraries and tools, which themselves have their dependencies and configurations. Versions have to be pinned appropriately as libraries frequently update and change their behavior. There is no description of these dependencies, and thus the process of gathering the metadata is mainly based on experience and trial-and-error. The challenge that we are addressing is to build an automated system that can track all of the pipeline dependencies, data inputs, and other sources of non-determinism to prepare an environment where data-analysis pipelines can repeatedly run, producing identical results.

Rigorous Engineering of Data Analysis Pipelines (RiGiD)

Program

Grantové projekty excelence v základním výzkumu EXPRO

Poskytovatel

Grantová agentura České republiky

Pracoviště

Laboratoř výzkumu programování
Oddělení pro vědu a výzkum

Řešitelé

prof. Jan Vitek, MSc., Ph.D.

Kód

GX23-07580X

Období

2023 - 2027

Popis

The RiGiD project lays the groundwork for this research programme and aims to develop a methodology for rigorous engineering of data analysis pipelines that can be adopted in practice. Our approach is pragmatic. Rather than chasing functional correctness, we hope to substantially reduce the incidence of errors in the wild. The research is structured in three overlapping chapters. First, identify the problem by carrying out user studies and large-scale program analysis of a corpus of over 100,000 data science pipelines. The outcome will be a catalog of error patterns as well as a labeled dataset to be shared with other researchers. The technical advances will focus on combining dynamic and static program analysis to approximate the behavior of partial programs and programs written in highly dynamic languages. The second part of our effort proposes a methodology and tooling for developing data sciences codes with reduced error rates. The technical contributions of this part of the project focus on lightweight specification techniques and, in particular, the development of a novel gradual typing system that deals with common programming idioms found in our corpus. This includes various forms of object orientation, data frames, and rich value specifications. These specifications are complemented with an automated test generation technique that combines test and input synthesis with fuzzing and test minimization. Finally, the execution environment is extended to support automatic reproducibility and result audits through data lineage. The third and last part of the work evaluates the proposal by conducting user studies and developing tools for automating deployment. The contribution will be a qualitative and quantitative assessment of the RiGiD methodology and tooling. The technical contribution will be tools that leverage program analysis to infer approximate specifications to assist deployment and adoption. Our tools target R, a language for data analytics with 2 milli

prof. Jan Vitek, MSc., Ph.D.

Projekty

Big Code: Škálovatelná analýza rozsáhlých bází programů

Evolving Language Ecosystems

Reproducible Data Analysis for All

Rigorous Engineering of Data Analysis Pipelines (RiGiD)