prof. Ing. Jan Holub, Ph.D.

Theses

Dissertation theses

Data Indexing for Bioinformatics

Level
Topic of dissertation thesis
Topic description

Cheap technology for genome sequencing started a huge expansion of data to be stored and processed. Such a huge volume of data is needed for personalized medicine, for pangenome research, or for finding biomarkers for various diseases. Traditional indices allowing efficient search fail this time due to exceeding the available memory. They also do not exploit the important property of high similarity of human genome sequences.

In the last few years, new developments in stringology brought various compressed indices based on the Burrows-Wheeler transform. The goal of the project is to find more efficient methods for storing and indexing data for various tasks in Bioinformatics.

Natural Language Text Compression

Level
Topic of dissertation thesis
Topic description

Claude Shannon made an experiment in 1951 concluding that the entropy of English text is 0.6-1.3 bits per character. The model of such experiment includes knowledge of English grammar, English vocabulary, and the most common English phrases.

The goal of the dissertation is to increase the efficiency of compression of natural language texts using the relaxation of the above model. Several phases of natural language analysis will be used both to understand the text and to compress the text more efficiently.

XML Data Compression

Level
Topic of dissertation thesis
Topic description

The XML data format was introduced many years ago and it is widely used as a defacto standard for data representation and exchange over the WWW now. There are many XML compression techniques developed. Some do not allow queries without decompression of whole XML document, some do. However, the latest developments in stringology brought us fast and compressed data structures to store data based on the Burrows-Wheeler transform. The goal of the dissertation is to find more efficient methods for storing and querying data in XML format. Those methods should keep easy and fast access to the stored data while improving memory complexity.

Master theses

Music sheet compression

Author
Jan Baier
Year
2012
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Miroslav Balík, Ph.D.

Incremental Clustering-Based Compression

Author
Luboš Krčál
Year
2014
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Jan Trávníček, Ph.D.

Approximate String Matching for Self-Indexes

Author
Lukáš Hrbek
Year
2015
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Summary
The work focuses on the approximate string matching with no more than k differences. Differences are defined by Levenshtein distance. I designed for the solution of this task filtering algorithm based on the pigeonhole principle. The text is assumed to be large, implemented algorithm therefore uses FM-index. The program was in the field of searching in DNA compared with the BLAST tool. Experiments have shown that in some aspects is my implementation better.

Compressing and Indexing Highly Similar Strings using LZW

Author
Ondřej Perutka
Year
2015
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Summary
A new compression method based on LZW and sequence alignment is presented in this thesis. The algorithm is called ALZW and it is designed for compression of Highly Similar Strings. Strings in a given set are compressed relatively to a given reference sequence. Compared to similarly targeted RLZ and general purpose GZip, the algorithm offers very fast compression and it achieves good compression ratios for similar genomic sequences. It is even able to outperform the RLZ algorithm in case of human chromosome 20.

Parallel Lossless Data Compression

Author
Ondřej Fiedler
Year
2016
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Summary
This thesis deals with the parallel lossless data compression, focusing on the usage of a graphics processing unit (GPU). It gives an introduction to data compression and parallel architectures and technologies. It contains a survey of parallel lossless data compressors for multiple platforms, including GPUs. The main contribution of the thesis is an improved implementation of PMLZSS, which is a dictionary data compression method for GPUs based on LZSS by Bin Zhou, Hai Jin and Ran Zheng from 2015. All the technological and algorithmic improvements are described and evaluated. Experiments are performed to compare the implementation with a sequential implementation of LZSS and with existing parallel solutions for multi-core processors and GPUs. The implementation is 9.4 to 22 times faster than the sequential solution while achieving comparable quality of compression. It is the fastest method for compressing binaries, documents or spreadsheets among measured solutions.

Compression of natural Czech text

Author
Jan Navara
Year
2017
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Summary
In scope of this thesis, 4 algorithms for lossless compression of natural Czech text have been designed and implemented. The algorithms utilize the lemmatiser, morphological tagger and morphological generator contained in open-source software MorphoDiTa. First of the algorithms is based on storing lemmas of individual words and generating the word forms, the second one uses stored part-of-speech tags to estimate probability of words, the third one estimates probability of a word using part-of-speech tag of the previous word and the fourth one is an extension of the first algorithm, storing some parts of the tag alongside the lemma. The achieved compression ratios have been compared to compression ratios achieved by a basic word-based reference algorithm. The comparison has shown that the second and the third algorithm are not better than the reference algorithm in terms of compression ratio, while the first and the fourth algorithm are able to achieve better compression ratio than the reference algorithm when using an appropriate configuration (they typically improve the compression ratio by several tenths of percent). This thesis thus proved that using linguistic tools in compression of natural Czech texts may be beneficial in terms of compression ratio. An important by-product of this thesis is a highly universal implementation of adaptive PPM compression algorithm, which has been used as the core element of each of the above-mentioned algorithms.

Searching for CRISPR segments using self-index

Author
Ondřej Cvacho
Year
2016
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Summary
The work focuses on the succinct data structures and their use for optimizing search for CRISPR segments using self-index. The search for segments uses approximate string matching by generating all possible segments with m or less mismatches and searching for each of them. I have proposed the solution to reduce the number of possible segments using the De Bruijn graphs as a filter. The experiments have shown that we can reduce the number of the segments almost four times.

PPM Data Compression Method using de Bruijn Graphs

Author
Jakub Kulík
Year
2019
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Summary
This thesis presents a slightly different approach to the PPM compression algorithm based on the succinct de Bruijn graphs. It uses dynamic bit vectors, and wavelet trees merged into a single tree to create a high performance dynamic succinct data structure capable of representing graphs used by the compressor. Even though the compressor is slow compared to other widely used compressors, it achieves good compression ratios while using much less memory.

Bioinformatics index tool for elastic degenerate string matching

Author
Dominika Bohuslavová
Year
2023
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Luboš Krčál, Ph.D.
Summary
The ability to search efficiently over large amount of data is an important part not only the field of bioinformatics but throughout the modern time. Every organism is unique, and to create a complex structure appropriately representing genomes and their variants while at the same time being able to work with them efficiently seems to be the main pan-genomic research direction. This text deals with the discussions about the implementation of a relatively new algorithm called BIO-FMI with the potential to efficiently compress and search over a set of highly repetitive strings, such as the DNA sequences. The storage principles of variants are currently insufficient due to their position against one reference genome and some alternative solutions are being sought now. This thesis specializes in a modification of algorithm BIO-FMI to the format of elastic-degenerate strings which become candidate representatives in terms of storing variants. The thesis shows promising results in index construction time, setting variability while comparing them with other algorithms in this field, namely LZ-RLBWT and r-index.