Dissertation theses
Data Indexing for Bioinformatics
Cheap technology for genome sequencing started a huge expansion of data to be stored and processed. Such a huge volume of data is needed for personalized medicine, for pangenome research, or for finding biomarkers for various diseases. Traditional indices allowing efficient search fail this time due to exceeding the available memory. They also do not exploit the important property of high similarity of human genome sequences.
In the last few years, new developments in stringology brought various compressed indices based on the Burrows-Wheeler transform. The goal of the project is to find more efficient methods for storing and indexing data for various tasks in Bioinformatics.
Natural Language Text Compression
Claude Shannon made an experiment in 1951 concluding that the entropy of English text is 0.6-1.3 bits per character. The model of such experiment includes knowledge of English grammar, English vocabulary, and the most common English phrases.
The goal of the dissertation is to increase the efficiency of compression of natural language texts using the relaxation of the above model. Several phases of natural language analysis will be used both to understand the text and to compress the text more efficiently.
XML Data Compression
The XML data format was introduced many years ago and it is widely used as a defacto standard for data representation and exchange over the WWW now. There are many XML compression techniques developed. Some do not allow queries without decompression of whole XML document, some do. However, the latest developments in stringology brought us fast and compressed data structures to store data based on the Burrows-Wheeler transform. The goal of the dissertation is to find more efficient methods for storing and querying data in XML format. Those methods should keep easy and fast access to the stored data while improving memory complexity.