Dissertation theses
Data Indexing for Bioinformatics
Supervisor
Level
Topic of dissertation thesis
Topic description
Cheap technology for genome sequencing started a huge expansion of data to be stored and processed. Such a huge volume of data is needed for personalized medicine, for pangenome research, or for finding biomarkers for various diseases. Traditional indices allowing efficient search fail this time due to exceeding the available memory. They also do not exploit the important property of high similarity of human genome sequences.
In the last few years, new developments in stringology brought various compressed indices based on the Burrows-Wheeler transform. The goal of the project is to find more efficient methods for storing and indexing data for various tasks in Bioinformatics.
Master theses
Music sheet compression
Author
Jan Baier
Year
2012
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Miroslav Balík, Ph.D.
Department
Incremental Clustering-Based Compression
Author
Luboš Krčál
Year
2014
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Jan Trávníček, Ph.D.
Department
Approximate String Matching for Self-Indexes
Author
Lukáš Hrbek
Year
2015
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Department
Summary
The work focuses on the approximate string matching with no more than k differences. Differences are defined by Levenshtein distance. I designed for the solution of this task filtering algorithm based on the pigeonhole principle. The text is assumed to be large, implemented algorithm therefore uses FM-index. The program was in the field of searching in DNA compared with the BLAST tool. Experiments have shown that in some aspects is my implementation better.
Compressing and Indexing Highly Similar Strings using LZW
Author
Ondřej Perutka
Year
2015
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Department
Summary
A new compression method based on LZW and sequence alignment is
presented in this thesis. The algorithm is called ALZW and it is designed for
compression of Highly Similar Strings. Strings in a given set are
compressed relatively to a given reference sequence. Compared to similarly
targeted RLZ and general purpose GZip, the algorithm offers very fast
compression and it achieves good compression ratios for similar genomic
sequences. It is even able to outperform the RLZ algorithm in case of human
chromosome 20.
Parallel Lossless Data Compression
Author
Ondřej Fiedler
Year
2016
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
doc. Ing. Ivan Šimeček, Ph.D.
Department
Summary
This thesis deals with the parallel lossless data compression, focusing on the usage of a graphics processing unit (GPU). It gives an introduction to data compression and parallel architectures and technologies. It contains a survey of parallel lossless data compressors for multiple platforms, including GPUs.
The main contribution of the thesis is an improved implementation of PMLZSS, which is a dictionary data compression method for GPUs based on LZSS by Bin Zhou, Hai Jin and Ran Zheng from 2015. All the technological and algorithmic improvements are described and evaluated.
Experiments are performed to compare the implementation with a sequential implementation of LZSS and with existing parallel solutions for multi-core processors and GPUs. The implementation is 9.4 to 22 times faster than the sequential solution while achieving comparable quality of compression. It is the fastest method for compressing binaries, documents or spreadsheets among measured solutions.
Compression of natural Czech text
Author
Jan Navara
Year
2017
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Department
Summary
In scope of this thesis, 4 algorithms for lossless compression of natural Czech text have been designed and implemented. The algorithms utilize the lemmatiser, morphological tagger and morphological generator contained in open-source software MorphoDiTa. First of the algorithms is based on storing lemmas of individual words and generating the word forms, the second one uses stored part-of-speech tags to estimate probability of words, the third one estimates probability of a word using part-of-speech tag of the previous word and the fourth one is an extension of the first algorithm, storing some parts of the tag alongside the lemma. The achieved compression ratios have been compared to compression ratios achieved by a basic word-based reference algorithm. The comparison has shown that the second and the third algorithm are not better than the reference algorithm in terms of compression ratio, while the first and the fourth algorithm are able to achieve better compression ratio than the reference algorithm when using an appropriate configuration (they typically improve the compression ratio by several tenths of percent). This thesis thus proved that using linguistic tools in compression of natural Czech texts may be beneficial in terms of compression ratio. An important by-product of this thesis is a highly universal implementation of adaptive PPM compression algorithm, which has been used as the core element of each of the above-mentioned algorithms.
Searching for CRISPR segments using self-index
Author
Ondřej Cvacho
Year
2016
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Department
Summary
The work focuses on the succinct data structures and their use for optimizing search for CRISPR segments using self-index. The search for segments uses approximate string matching by generating all possible segments with m or less mismatches and searching for each of them. I have proposed the solution to reduce the number of possible segments using the De Bruijn graphs as a filter. The experiments have shown that we can reduce the number of the segments almost four times.
PPM Data Compression Method using de Bruijn Graphs
Author
Jakub Kulík
Year
2019
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Petr Procházka, Ph.D.
Department
Summary
This thesis presents a slightly different approach to the PPM compression algorithm based on the succinct de Bruijn graphs. It uses dynamic bit vectors, and wavelet trees merged into a single tree to create a high performance dynamic succinct data structure capable of representing graphs used by the compressor. Even though the compressor is slow compared to other widely used compressors, it achieves good compression ratios while using much less memory.
Bioinformatics index tool for elastic degenerate string matching
Author
Dominika Bohuslavová
Year
2023
Type
Master thesis
Supervisor
prof. Ing. Jan Holub, Ph.D.
Reviewers
Ing. Luboš Krčál, Ph.D.
Department
Summary
The ability to search efficiently over large amount of data is an important part not only the field of bioinformatics but throughout the modern time.
Every organism is unique, and to create a complex structure appropriately representing genomes and their variants while at the same time being able to work with them efficiently seems to be the main pan-genomic research direction. This text deals with the discussions about the implementation of a relatively new algorithm called BIO-FMI with the potential to efficiently compress and search over a set of highly repetitive strings, such as the DNA sequences. The storage principles of variants are currently insufficient due to their position against one reference genome and some alternative solutions are being sought now. This thesis specializes in a modification of algorithm BIO-FMI to the format of elastic-degenerate strings which become candidate representatives in terms of storing variants. The thesis shows promising results in index construction time, setting variability while comparing them with other algorithms in this field, namely LZ-RLBWT and r-index.