prof. Ing. Jan Holub, Ph.D.

Profile
Publications
Projects
Teaching
Theses

Theses

Topics of theses
Sample theses

Dissertation theses

Data Indexing for Bioinformatics

Supervisor

prof. Ing. Jan Holub, Ph.D.

Level

Topic of dissertation thesis

Topic description

Cheap technology for genome sequencing started a huge expansion of data to be stored and processed. Such a huge volume of data is needed for personalized medicine, for pangenome research, or for finding biomarkers for various diseases. Traditional indices allowing efficient search fail this time due to exceeding the available memory. They also do not exploit the important property of high similarity of human genome sequences.

In the last few years, new developments in stringology brought various compressed indices based on the Burrows-Wheeler transform. The goal of the project is to find more efficient methods for storing and indexing data for various tasks in Bioinformatics.

Master theses

Music sheet compression

Author

Jan Baier

Year

2012

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Miroslav Balík, Ph.D.

Department

Department of Theoretical Computer Science

Incremental Clustering-Based Compression

Author

Luboš Krčál

Year

2014

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Jan Trávníček, Ph.D.

Department

Department of Theoretical Computer Science

Approximate String Matching for Self-Indexes

Author

Lukáš Hrbek

Year

2015

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Petr Procházka, Ph.D.

Department

Department of Theoretical Computer Science

Summary

The work focuses on the approximate string matching with no more than k differences. Differences are defined by Levenshtein distance. I designed for the solution of this task filtering algorithm based on the pigeonhole principle. The text is assumed to be large, implemented algorithm therefore uses FM-index. The program was in the field of searching in DNA compared with the BLAST tool. Experiments have shown that in some aspects is my implementation better.

Thesis on DSpace

Compressing and Indexing Highly Similar Strings using LZW

Author

Ondřej Perutka

Year

2015

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Petr Procházka, Ph.D.

Department

Department of Theoretical Computer Science

Summary

A new compression method based on LZW and sequence alignment is presented in this thesis. The algorithm is called ALZW and it is designed for compression of Highly Similar Strings. Strings in a given set are compressed relatively to a given reference sequence. Compared to similarly targeted RLZ and general purpose GZip, the algorithm offers very fast compression and it achieves good compression ratios for similar genomic sequences. It is even able to outperform the RLZ algorithm in case of human chromosome 20.

Thesis on DSpace

Parallel Lossless Data Compression

Author

Ondřej Fiedler

Year

2016

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

doc. Ing. Ivan Šimeček, Ph.D.

Department

Department of Theoretical Computer Science

Summary

This thesis deals with the parallel lossless data compression, focusing on the usage of a graphics processing unit (GPU). It gives an introduction to data compression and parallel architectures and technologies. It contains a survey of parallel lossless data compressors for multiple platforms, including GPUs. The main contribution of the thesis is an improved implementation of PMLZSS, which is a dictionary data compression method for GPUs based on LZSS by Bin Zhou, Hai Jin and Ran Zheng from 2015. All the technological and algorithmic improvements are described and evaluated. Experiments are performed to compare the implementation with a sequential implementation of LZSS and with existing parallel solutions for multi-core processors and GPUs. The implementation is 9.4 to 22 times faster than the sequential solution while achieving comparable quality of compression. It is the fastest method for compressing binaries, documents or spreadsheets among measured solutions.

Compression of natural Czech text

Author

Jan Navara

Year

2017

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Petr Procházka, Ph.D.

Department

Department of Theoretical Computer Science

Summary

In scope of this thesis, 4 algorithms for lossless compression of natural Czech text have been designed and implemented. The algorithms utilize the lemmatiser, morphological tagger and morphological generator contained in open-source software MorphoDiTa. First of the algorithms is based on storing lemmas of individual words and generating the word forms, the second one uses stored part-of-speech tags to estimate probability of words, the third one estimates probability of a word using part-of-speech tag of the previous word and the fourth one is an extension of the first algorithm, storing some parts of the tag alongside the lemma. The achieved compression ratios have been compared to compression ratios achieved by a basic word-based reference algorithm. The comparison has shown that the second and the third algorithm are not better than the reference algorithm in terms of compression ratio, while the first and the fourth algorithm are able to achieve better compression ratio than the reference algorithm when using an appropriate configuration (they typically improve the compression ratio by several tenths of percent). This thesis thus proved that using linguistic tools in compression of natural Czech texts may be beneficial in terms of compression ratio. An important by-product of this thesis is a highly universal implementation of adaptive PPM compression algorithm, which has been used as the core element of each of the above-mentioned algorithms.

Thesis on DSpace

Searching for CRISPR segments using self-index

Author

Ondřej Cvacho

Year

2016

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Petr Procházka, Ph.D.

Department

Department of Theoretical Computer Science

Summary

The work focuses on the succinct data structures and their use for optimizing search for CRISPR segments using self-index. The search for segments uses approximate string matching by generating all possible segments with m or less mismatches and searching for each of them. I have proposed the solution to reduce the number of possible segments using the De Bruijn graphs as a filter. The experiments have shown that we can reduce the number of the segments almost four times.

Thesis on DSpace

PPM Data Compression Method using de Bruijn Graphs

Author

Jakub Kulík

Year

2019

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Petr Procházka, Ph.D.

Department

Department of Theoretical Computer Science

Summary

This thesis presents a slightly different approach to the PPM compression algorithm based on the succinct de Bruijn graphs. It uses dynamic bit vectors, and wavelet trees merged into a single tree to create a high performance dynamic succinct data structure capable of representing graphs used by the compressor. Even though the compressor is slow compared to other widely used compressors, it achieves good compression ratios while using much less memory.

Thesis on DSpace

Bioinformatics index tool for elastic degenerate string matching

Author

Dominika Bohuslavová

Year

2023

Type

Master thesis

Supervisor

prof. Ing. Jan Holub, Ph.D.

Reviewers

Ing. Luboš Krčál, Ph.D.

Department

Department of Theoretical Computer Science

Summary

The ability to search efficiently over large amount of data is an important part not only the field of bioinformatics but throughout the modern time. Every organism is unique, and to create a complex structure appropriately representing genomes and their variants while at the same time being able to work with them efficiently seems to be the main pan-genomic research direction. This text deals with the discussions about the implementation of a relatively new algorithm called BIO-FMI with the potential to efficiently compress and search over a set of highly repetitive strings, such as the DNA sequences. The storage principles of variants are currently insufficient due to their position against one reference genome and some alternative solutions are being sought now. This thesis specializes in a modification of algorithm BIO-FMI to the format of elastic-degenerate strings which become candidate representatives in terms of storing variants. The thesis shows promising results in index construction time, setting variability while comparing them with other algorithms in this field, namely LZ-RLBWT and r-index.

Thesis on DSpace