The Prague Stringology Club (PSC)

Efficient String Matching for Bioinformatics

Program

Standard projects

Provider

Czech Science Foundation

Departments

Department of Theoretical Computer Science

Investigators

prof. Ing. Jan Holub, Ph.D.

Code

GA19-20759S

Period

2019 - 2020

Description

Index is a way to rapidly increase the speed of searching in data given in advance. The constructed index then allows search time proportional to the pattern size and the number of its occurrences. The aim of the project is to develop new algorithms and data structures for areas managing large data collections (DNA/RNA sequences) supporting not only exact pattern matching but also more complex tasks like degenerate or elastic pattern matching. There are many advanced techniques for indexing strings combining both data compression and stringology. However, there are still challenging new tasks for special cases like indexing highly similar texts where general purpose indexing methods are not efficient. This is for instance the case of genomes of the same species. Some on-line methods for elastic pattern matching will also be developed.

Text and Tree Structures Processing and Their Applications

Program

Standard projects

Provider

Czech Science Foundation

Departments

Department of Theoretical Computer Science

Investigators

prof. Ing. Jan Holub, Ph.D.

Code

GA13-03253S

Period

2013 - 2015

Description

The project deals with four topics which are closely related: Arbology, Data Compression for natural languages, and selected topics of Stringology and Bioinformatics. In Arbology we research new indexing and pattern matching algorithms on trees. In Bioinformatics we work on problems of mapping millions of short reads to genomic sequences and their indexing. In Data Compression we focus on efficient algorithms for natural languages based on knowledge of the source language and on algorithms allowing fast compression and decompression as well as efficient search. In Stringology we work on 2D text indexing and on algorithms for identifying cribbed texts and source codes, which may be compressed.

String and Tree Analysis and Processing

Program

Standard projects

Provider

Czech Science Foundation

Departments

Department of Theoretical Computer Science

Investigators

prof. Ing. Jan Holub, Ph.D.

Code

GA201/09/0807

Period

2009 - 2011

Description

Information society uses results of pattern matching every day and its importance keeps rising. The pattern matching is no longer limited to ordinary texts. Searching in more complex structures is required like searching in trees (XML data structures), in 2D images, or in compressed data. The proposed project aims not only to extend our research results in Stringology, but also apply our knowledge in quite new topic dealing with pattern matching in trees that we call Arborology. Our strong background in parsing seems to be very efficiently utilized in Arborology. In Stringology we would like to continue on topics like multidimensional pattern matching, searching for regularities in strings, generalized string matching, and parallel approaches to pattern matching. In Data Compression we developed algorithms for exact pattern matching in compressed data. We want to improve our results and expand to approximate pattern matching.

Projects

Efficient String Matching for Bioinformatics

Text and Tree Structures Processing and Their Applications

String and Tree Analysis and Processing