Ing. Petr Procházka, Ph.D.

Backward Pattern Matching on Elastic Degenerate Strings

Autoři

Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.

Rok

2021

Publikováno

Proceedings of 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2021. p. 50-59. vol. 3. ISSN 2184-4305. ISBN 978-989-758-490-9.

Typ

Stať ve sborníku

Pracoviště

Katedra teoretické informatiky

Anotace

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd m w e) worst-case time. This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O (m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

On-line Searching in IUPAC nucleotide sequences

Autoři

Procházka, P.; Holub, J.

Rok

2019

Publikováno

Proceedings of 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2019. p. 66-77. vol. 3. ISSN 2184-4305. ISBN 978-989-758-353-7.

Typ

Stať ve sborníku

DOI

10.5220/0007382900660077

Pracoviště

Katedra teoretické informatiky

Anotace

We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O (m + mα2 log m)-space data structure and O(mα2 ) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O (nm2α4 ) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.

Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Autoři

Procházka, P.; Holub, J.

Rok

2017

Publikováno

Proceedings of 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Saarbrücken: Dagstuhl Publishing,, 2017. p. 20:1-20:13. ISSN 1868-8969. ISBN 978-3-95977-050-7.

Typ

Stať ve sborníku

DOI

10.4230/LIPIcs.WABI.2017.20

Pracoviště

Katedra teoretické informatiky

Anotace

In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern $m$. A simple $\mathcal{O}(m)$-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block $q$-gram inverted index. The block $q$-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.

Towards Efficient Positional Inverted Index

Autoři

Procházka, P.; Holub, J.

Rok

2017

Publikováno

Algorithms. 2017, 10(1), ISSN 1999-4893.

Typ

Článek

DOI

10.3390/a10010030

Pracoviště

Katedra teoretické informatiky

Anotace

We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N−n)⌈log2bN⌉+(⌊N−nα⌋+n)×(⌈log2bn⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑Ni=1⌈log2pn(i)i⌉−∑N′j=1⌈log2p′j⌉+N) with respect to pn(i)i as a probability of a stem variant at position i, p′j as the probability of separator or stop word at position j and N′ as the number of separators and stop words

Positional Inverted Self-index

Autoři

Procházka, P.; Holub, J.

Rok

2016

Publikováno

Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 627. ISSN 1068-0314. ISBN 978-1-5090-1853-6.

Typ

Stať ve sborníku

Pracoviště

Katedra teoretické informatiky

Anotace

We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade-off between compression effectiveness and the decompression speed can be tuned by different parameters of the index algorithm. ⌉ ×

Towards Efficient Positional Inverted Index

Autoři

Procházka, P.; Holub, J.

Rok

2016

Publikováno

Proceedings of International Conference on Data Compression, Communication, Processing and Security 2016. Salerno: Universita di Salerno, 2016.

Typ

Stať ve sborníku

Pracoviště

Katedra teoretické informatiky

Anotace

Abstract—We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade- off between compression effectiveness and decompression speed can be tuned by different parameters of the index algorithm. The space complexity of the PISI inverted list is O((N − n)⌈log 2b N ⌉+(⌊ N α −n ⌋+n)×(⌈log2b n⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given ∑ in N bits. The n(i) space complexity N′′ of the presentation layer is O(− i=1 ⌈log2 pi ⌉− j=1 ⌈log2 pj ⌉+N ) n(i) with respect to pi as a probability of a stem variant at position i, p′ j as the probability of separator or stop word at ′position j and N as the number of separators and stop words. Our experiments prove that PISI is close to standard positional inverted index in terms of search speed and, at the same time, it is more efficient in memory consumption. PISI also proved that it is significantly faster than its close competitor FWCSA [17] in terms of search speed at the same level of memory consumption.

Compression of a Set of Files with Natural Language Content

Autoři

Procházka, P.; Holub, J.

Rok

2015

Publikováno

Computer Journal. 2015, 58(5), 1169-1185. ISSN 0010-4620.

Typ

Článek

DOI

10.1093/comjnl/bxu046

Pracoviště

Katedra teoretické informatiky

Anotace

An algorithm for very efficient compression of a set of natural language text files is presented. Not only a very good compression ratio is reached, the used compression method allows fast pattern matching in compressed text, which is an attractive property especially for search engines. Much information is stored in the form of a large collection of text files. The web search engines can store the web pages in the raw text form to build so-called snippets or to perform so-called positional ranking functions on them. Furthermore, there exist many other similar contexts such as the storage of emails, application logs or the databases of text files (literary works or technical reports). In this paper, we address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm is based on a word-based approach and the idea of combination of two statistical models: global model (common for all the files of the set) and local model. The latter is built as a set of changes that transform the global model to the proper model of the single compressed file.

Compressing Similar Biological Sequences using FM-index

Autoři

Procházka, P.; Holub, J.

Rok

2014

Publikováno

Proceedings of Data Compression Conference 2014. Los Alamitos: IEEE Computer Society, 2014. p. 312-321. ISSN 1068-0314. ISBN 978-1-4799-3882-7.

Typ

Stať ve sborníku

DOI

10.1109/DCC.2014.47

Pracoviště

Katedra teoretické informatiky

Anotace

Nowadays, decreasing cost and better accessibility of sequencing methods have enabled studies of genetic variation between individuals of the same species and also between two related species. This has led to a rapid increase in biological data consisting of sequences that are very similar to each other, these sequences usually being stored together in one database. We propose a compression method based on Wavelet Tree FM-index optimized for compression of a set of similar biological sequences. The compression method is based on tracking single changes (together with their context) between every single sequence and the chosen reference sequence. We call our compression method BIO-FMI. It gives very promising results in compression ratio and in locate time when performed on an extremely repetitive data set (less than 0.5 % mutations) and when the searched patterns are of smaller lengths (less than 20 bases). BIO-FMI is competitive in extraction speed and it seems to be superior in time needed to build the index, especially in the case when the alignments of single sequences are given in advance.

Natural Language Compression Optimized for Large Set of Files

Autoři

Procházka, P.; Holub, J.

Rok

2013

Publikováno

Proceedings of Data Compression Conference 2013. Los Alamitos, CA: IEEE Computer Soc., 2013. p. 514. ISSN 1068-0314. ISBN 978-0-7695-4965-1.

Typ

Stať ve sborníku

Pracoviště

Katedra teoretické informatiky

Anotace

The paper describes one of the typical scenarios of data compression for large collections of files. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time.

ODC: Frame for definition of Dense codes

Autoři

Procházka, P.; Holub, J.

Rok

2013

Publikováno

European Journal of Combinatorics. 2013, 2013(1), 52-68. ISSN 0195-6698.

Typ

Článek

DOI

10.1016/j.ejc.2012.07.014

Pracoviště

Katedra teoretické informatiky

Anotace

The natural language compression made a great progress in last two decades. The main step in this evolution was the introduction of word-based compression by Moffat. Another improvement came with so called Dense codes which proved to be very fast in compression and decompression while keeping good compression ratio and direct search capability. Many variants of the Dense codes were described, each of them using its own definition. In this paper we present generalized concept of dense coding called Open Dense Code (ODC) which aims to be a frame for definition of many other dense code schemas. ODC underlines common features of the dense code schemas but at the same time allows to express the divergences of each of them. Using the frame of ODC we present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (STDC). Our algorithms improve the compression ratio and are considerate to smaller files which are very often omitted.

Block-oriented Dense Compressor

Autoři

Procházka, P.; Holub, J.

Rok

2011

Publikováno

Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 372. ISSN 1068-0314. ISBN 978-0-7695-4352-9.

Typ

Stať ve sborníku

DOI

10.1109/DCC.2011.76

Pracoviště

Katedra teoretické informatiky

Anotace

We address the problem of block-oriented natural language compression. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on Dense Code idea. It achieves very good compression ratio around 32 % on natural language text. We show that our compression method has some interesting properties which could be applied in digital libraries and other textual databases. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with corresponding vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.

Natural Language Compression per Blocks

Autoři

Procházka, P.; Holub, J.

Rok

2011

Publikováno

First International Conference on Data Compression, Communications and Processing. Los Alamitos: IEEE Computer Society, 2011. pp. 67-75. ISBN 978-0-7695-4528-8.

Typ

Stať ve sborníku

DOI

10.1109/CCP.2011.25

Pracoviště

Katedra teoretické informatiky

Anotace

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed is the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture.

New Word-based Adaptive Dense Compressors

Autoři

Procházka, P.; Holub, J.

Rok

2009

Publikováno

Combinatorial Algorithms. Berlin: Springer, 2009. pp. 420-431. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-10216-5.

Typ

Stať ve sborníku

DOI

10.1007/978-3-642-10217-2_41

Pracoviště

Katedra teoretické informatiky

Anotace

In the last two decades the natural language compression made a great progress. The main step in this evolution was the introduction of word-based compression by Mofiat. The word-based statistical compression algorithms are able to achieve 35% improvement in the compression ratio in comparison with character-based ones. We present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (SCDC). TBDC uses the codewords with maximal size 2 bytes and must be implemented with some pruning technique. STDC is able to tune its code space during the compression process and so achieve better compression. Our algorithms improve the compression ratio and are considerate to smaller ffles which are very often omitted. We present also a generalized concept of dense coding called Open Dense Code (ODC) which provides a frame for deffnition of these two and many other dense code schemas.

Ing. Petr Procházka, Ph.D.

Publikace

Backward Pattern Matching on Elastic Degenerate Strings

On-line Searching in IUPAC nucleotide sequences

Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Towards Efficient Positional Inverted Index

Positional Inverted Self-index

Towards Efficient Positional Inverted Index

Compression of a Set of Files with Natural Language Content

Compressing Similar Biological Sequences using FM-index

Natural Language Compression Optimized for Large Set of Files

ODC: Frame for definition of Dense codes

Block-oriented Dense Compressor

Natural Language Compression per Blocks

New Word-based Adaptive Dense Compressors