Efficient String Matching for Bioinformatics

Program
Standard projects
Provider
Czech Science Foundation
Code
GA19-20759S
Period
2019 - 2020
Description
Index is a way to rapidly increase the speed of searching in data given in advance. The constructed index then allows search time proportional to the pattern size and the number of its occurrences. The aim of the project is to develop new algorithms and data structures for areas managing large data collections (DNA/RNA sequences) supporting not only exact pattern matching but also more complex tasks like degenerate or elastic pattern matching. There are many advanced techniques for indexing strings combining both data compression and stringology. However, there are still challenging new tasks for special cases like indexing highly similar texts where general purpose indexing methods are not efficient. This is for instance the case of genomes of the same species. Some on-line methods for elastic pattern matching will also be developed.

Text and Tree Structures Processing and Their Applications

Program
Standard projects
Provider
Czech Science Foundation
Code
GA13-03253S
Period
2013 - 2015
Description
The project deals with four topics which are closely related: Arbology, Data Compression for natural languages, and selected topics of Stringology and Bioinformatics. In Arbology we research new indexing and pattern matching algorithms on trees. In Bioinformatics we work on problems of mapping millions of short reads to genomic sequences and their indexing. In Data Compression we focus on efficient algorithms for natural languages based on knowledge of the source language and on algorithms allowing fast compression and decompression as well as efficient search. In Stringology we work on 2D text indexing and on algorithms for identifying cribbed texts and source codes, which may be compressed.

String and Tree Analysis and Processing

Program
Standard projects
Provider
Czech Science Foundation
Code
GA201/09/0807
Period
2009 - 2011
Description
Information society uses results of pattern matching every day and its importance keeps rising. The pattern matching is no longer limited to ordinary texts. Searching in more complex structures is required like searching in trees (XML data structures), in 2D images, or in compressed data. The proposed project aims not only to extend our research results in Stringology, but also apply our knowledge in quite new topic dealing with pattern matching in trees that we call Arborology. Our strong background in parsing seems to be very efficiently utilized in Arborology. In Stringology we would like to continue on topics like multidimensional pattern matching, searching for regularities in strings, generalized string matching, and parallel approaches to pattern matching. In Data Compression we developed algorithms for exact pattern matching in compressed data. We want to improve our results and expand to approximate pattern matching.

The person responsible for the content of this page: doc. Ing. Štěpán Starosta, Ph.D.