prof. Ing. Jan Holub, Ph.D.

Projects

Algorithms for Next-Generation Sequencers

Program
Studentská grantová soutěž ČVUT
Code
SGS11/082/OHK3/1T/18
Period
2011
Description
Sequencing technology has come a long way since the time when traditional sequencing techniques required many labs around the world to cooperate for over a decade, in order to sequence the human genome for the first time. Nowadays, with the advent of next-generation sequencing (NGS) technologies, sequencing has been reduced to a matter of days or hours, and the cost has decreased by many orders of magnitude, making it an accessible experimental procedure to many laboratories. The data resulting from a single sequencing experiment can be quite large and it is not uncommon to have data from multiple experiments. This trend of increasing availability of sequencing will continue as projects even more ambitious than the 1000 Genomes Project start to materialize. This can be achieved only if computer scientists are designing new efficient algorithms for assembling millions of short sequences (reads), and for mapping a new set of reads to an existing genome sequence. The need for more efficie

Efficient String Matching for Bioinformatics

Program
Standard projects
Provider
Czech Science Foundation
Code
GA19-20759S
Period
2019 - 2020
Description
Index is a way to rapidly increase the speed of searching in data given in advance. The constructed index then allows search time proportional to the pattern size and the number of its occurrences. The aim of the project is to develop new algorithms and data structures for areas managing large data collections (DNA/RNA sequences) supporting not only exact pattern matching but also more complex tasks like degenerate or elastic pattern matching. There are many advanced techniques for indexing strings combining both data compression and stringology. However, there are still challenging new tasks for special cases like indexing highly similar texts where general purpose indexing methods are not efficient. This is for instance the case of genomes of the same species. Some on-line methods for elastic pattern matching will also be developed.

Natural Language and Formal Language Data Compression

Program
Studentská grantová soutěž ČVUT
Code
SGS10/306/OHK3/3T/18
Period
2010 - 2012
Description
The main goal of this project is design and implementation of novel methods of data compression. The contextual data compression methods are a part of lossless data compression methods. Lossless means that the compression process is fully reversible and decompressed data is identical to the original data. These methods are based on similarities in the input data. The main sight of nowadays research is aimed to word-based contextual data compression and preprocessing transformations. Word-based text compression is more adaptive to the input data. This approach takes advantage of the strictly defined structures of formal languages and natural languages and achieves better compression ratio than equivalent character-based data compression methods. The goal of this project is to use introduced possibilities to design new natural language compression methods with compression ratio comparable or better than the compression ratio of the best nowadays methods of data compression. This proje

Processing Tree Structures and Data Compression

Program
Studentská grantová soutěž ČVUT
Code
SGS13/097/OHK3/1T/18
Period
2013
Description
With the vast amount of data needed to be archived, indexed and procesed, special data structures are required. The tree is a typical data structure which is used very often for hierarchically storing data. Specialized algorithms are needed for indexing tree data structures and also accessing, extracting and analyzing data stored in them. The aim of this research is to design efficient yet simple to understand algorithms dealing with tree pattern matching (both exact and approximate) and tree indexing, and provide a toolkit implementation. Another goal of this project is design and implementation of novel methods of data compression in two areas: first, music score compression; second, natural language compression.

String and Tree Analysis and Processing

Program
Standard projects
Provider
Czech Science Foundation
Code
GA201/09/0807
Period
2009 - 2011
Description
Information society uses results of pattern matching every day and its importance keeps rising. The pattern matching is no longer limited to ordinary texts. Searching in more complex structures is required like searching in trees (XML data structures), in 2D images, or in compressed data. The proposed project aims not only to extend our research results in Stringology, but also apply our knowledge in quite new topic dealing with pattern matching in trees that we call Arborology. Our strong background in parsing seems to be very efficiently utilized in Arborology. In Stringology we would like to continue on topics like multidimensional pattern matching, searching for regularities in strings, generalized string matching, and parallel approaches to pattern matching. In Data Compression we developed algorithms for exact pattern matching in compressed data. We want to improve our results and expand to approximate pattern matching.

Text and Tree Structures Processing and Their Applications

Program
Standard projects
Provider
Czech Science Foundation
Code
GA13-03253S
Period
2013 - 2015
Description
The project deals with four topics which are closely related: Arbology, Data Compression for natural languages, and selected topics of Stringology and Bioinformatics. In Arbology we research new indexing and pattern matching algorithms on trees. In Bioinformatics we work on problems of mapping millions of short reads to genomic sequences and their indexing. In Data Compression we focus on efficient algorithms for natural languages based on knowledge of the source language and on algorithms allowing fast compression and decompression as well as efficient search. In Stringology we work on 2D text indexing and on algorithms for identifying cribbed texts and source codes, which may be compressed.