Evolving Language Ecosystems
2016 - 2022
The Evolving Language Ecosystems project explores the fundamental techniques and algorithms for evolving programming languages and their ecosystems. Our purpose is to reduce the cost of wide-ranging language changes and obviate the need for devising entirely new languages. Our findings will grant both researchers and practitioners a greater degree of freedom when experimenting with new ideas on how to express computation.
Rigorous Engineering of Data Analysis Pipelines (RiGiD)
2023 - 2027
The RiGiD project lays the groundwork for this research programme and aims to develop a methodology for rigorous engineering of data analysis pipelines that can be adopted in practice. Our approach is pragmatic. Rather than chasing functional correctness, we hope to substantially reduce the incidence of errors in the wild. The research is structured in three overlapping chapters. First, identify the problem by carrying out user studies and large-scale program analysis of a corpus of over 100,000 data science pipelines. The outcome will be a catalog of error patterns as well as a labeled dataset to be shared with other researchers. The technical advances will focus on combining dynamic and static program analysis to approximate the behavior of partial programs and programs written in highly dynamic languages. The second part of our effort proposes a methodology and tooling for developing data sciences codes with reduced error rates. The technical contributions of this part of the project focus on lightweight specification techniques and, in particular, the development of a novel gradual typing system that deals with common programming idioms found in our corpus. This includes various forms of object orientation, data frames, and rich value specifications. These specifications are complemented with an automated test generation technique that combines test and input synthesis with fuzzing and test minimization. Finally, the execution environment is extended to support automatic reproducibility and result audits through data lineage. The third and last part of the work evaluates the proposal by conducting user studies and developing tools for automating deployment. The contribution will be a qualitative and quantitative assessment of the RiGiD methodology and tooling. The technical contribution will be tools that leverage program analysis to infer approximate specifications to assist deployment and adoption. Our tools target R, a language for data analytics with 2 milli