The Prague Stringology Club (PSC)

Backward Pattern Matching on Elastic-Degenerate Strings

Authors

Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.

Year

2023

Published

SN Computer Science. 2023, 4(5), ISSN 2662-995X.

Type

Article

DOI

10.1007/s42979-023-01760-x

Departments

Department of Theoretical Computer Science

Annotation

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd w m e) worst-case time. This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays

Authors

Krčál, L.; Ho, S.-S.; Holub, J.

Year

2022

Published

Database Systems for Advanced Applications. Springer, Cham, 2022. p. 509-525. Lecture Notes in Computer Science. vol. 13245. ISSN 0302-9743. ISBN 978-3-031-00122-2.

Type

Proceedings paper

DOI

10.1007/978-3-031-00123-9_40

Departments

Department of Theoretical Computer Science

Annotation

Bitmap indices are widely used in commercial databases for processing complex queries, due to their efficient use of bit-wise operations. Bitmap indices apply natively to relational and linear datasets, with distinct separation of the columns or attributes, but do not perform well on multidimensional array scientific data. We propose a new method for multidimensional array indexing that considers the spatial component of multidimensional arrays. The hierarchical indexing method is based on sparse n-dimensional trees for dimension partitioning, and bitmap indexing with adaptive binning for attribute partitioning. This indexing performs well on range queries involving both dimension and attribute constraints, as it prunes the search space early. Moreover, the indexing is easily extensible to membership queries. The indexing method was implemented on top of a state of the art bitmap indexing library Fastbit, using tables partitioned along any subset of the data dimensions. We show that the hierarchical bitmap index outperforms conventional bitmap indexing, where an auxiliary attribute is required for each dimension. Furthermore, the adaptive binning significantly reduces the amount of bins and therefore memory requirements.

PFP Compressed Suffix Trees

Authors

Boucher, C.; Cvacho, O.; Gagie, T.; Holub, J.; Manzini, G.; Navarro, G.; Rossi, M.

Year

2021

Published

Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX2021). Philadelphia: SIAM, 2021. p. 60-72. ISSN 2164-0300. ISBN 978-1-61197-647-2.

Type

Proceedings paper

DOI

10.1137/1.9781611976472.5

Departments

Department of Theoretical Computer Science

Annotation

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to sup- port full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

Data Structures to Represent a Set of k-long DNA Sequences

Authors

Chikhi, R.; Holub, J.; Medvedev, P.

Year

2021

Published

ACM Computing Surveys. 2021, 54(1), 17:1-17:22. ISSN 0360-0300.

Type

Article

DOI

10.1145/3445967

Departments

Department of Theoretical Computer Science

Annotation

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

Backward Pattern Matching on Elastic Degenerate Strings

Authors

Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.

Year

2021

Published

Proceedings of 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2021. p. 50-59. vol. 3. ISSN 2184-4305. ISBN 978-989-758-490-9.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd m w e) worst-case time. This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O (m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

On-line Searching in IUPAC nucleotide sequences

Authors

Procházka, P.; Holub, J.

Year

2019

Published

Proceedings of 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2019. p. 66-77. vol. 3. ISSN 2184-4305. ISBN 978-989-758-353-7.

Type

Proceedings paper

DOI

10.5220/0007382900660077

Departments

Department of Theoretical Computer Science

Annotation

We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O (m + mα2 log m)-space data structure and O(mα2 ) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O (nm2α4 ) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.

SOPanG: online text searching over a pan-genome

Authors

Cislak, A.; Grabowski, S.; Holub, J.

Year

2018

Published

Bioinformatics. 2018, 34(24), 4290-4292. ISSN 1367-4803.

Type

Article

DOI

10.1093/bioinformatics/bty506

Departments

Department of Theoretical Computer Science

Annotation

Motivation: The many thousands of high-quality genomes available now-a-days imply a shift from single genome to pan-genomic analyses. A basic algorithmic building brick for such a scenario is online search over a collection of similar texts, a problem with surprisingly few solutions presented so far.

Filtering Invalid Off-Targets in CRISPR/Cas9 Design Tools

Authors

Cvacho, O.; Holub, J.

Year

2018

Published

Proceedings of Data Compression Conference 2018. New York: IEEE Computer Society Press, 2018. p. 403. ISSN 2375-0359. ISBN 978-1-5386-4883-4.

Type

Proceedings paper

DOI

10.1109/DCC.2018.00056

Departments

Department of Theoretical Computer Science

Annotation

The paper deals with an approximate string matching in CRISPR/Cas9 design tools. The approximate string matching is transformed into sequence of exact string matching tasks processed by fast succinct (compressed) self-index. The number of the tasks is rapidly decreased by filtering technique based on de Bruijn graph.

Reconstructing a String from its Lyndon Arrays

Authors

Daykin, J.W.; Franek, F; Holub, J.; Sohidull Islam, A.S.M.; Smyth, W.F.

Year

2018

Published

Theoretical Computer Science. 2018, 710 44-51. ISSN 0304-3975.

Type

Article

DOI

10.1016/j.tcs.2017.04.008

Departments

Department of Theoretical Computer Science

Annotation

Given a string x = x[1.. n] on an ordered alphabet of size σ , the Lyndon array λ = λx [1..n] of x is an array of positive integers such that λ[i], 1 ≤ i ≤ n, is the length of the maximal Lyndon word over the ordering of that begins at position i in x. The Lyndon array has recently attracted considerable attention due to its pivotal role in establishing the long-standing conjecture that ρ (n ) < n, where ρ ( n) is the maximum number of maximal periodicities (runs) in any string of length n. Here we first describe two linear-time algorithms that, given a valid Lyndon array λ, compute a corresponding string — one for an alphabet of size n, the other for a smaller alphabet. We go on to describe another linear-time algorithm that determines whether or not a given integer array is a Lyndon array of some string. Finally we show how σ Lyndon arrays λ = {λ1 = λ, λ2 , . . . , λσ } corresponding to σ “rotations” of the alphabet can be used to determine uniquely the string x on such that λx = λ.

Towards Efficient Positional Inverted Index

Authors

Procházka, P.; Holub, J.

Year

2017

Published

Algorithms. 2017, 10(1), ISSN 1999-4893.

Type

Article

DOI

10.3390/a10010030

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N−n)⌈log2bN⌉+(⌊N−nα⌋+n)×(⌈log2bn⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑Ni=1⌈log2pn(i)i⌉−∑N′j=1⌈log2p′j⌉+N) with respect to pn(i)i as a probability of a stem variant at position i, p′j as the probability of separator or stop word at position j and N′ as the number of separators and stop words

Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Authors

Procházka, P.; Holub, J.

Year

2017

Published

Proceedings of 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Saarbrücken: Dagstuhl Publishing,, 2017. p. 20:1-20:13. ISSN 1868-8969. ISBN 978-3-95977-050-7.

Type

Proceedings paper

DOI

10.4230/LIPIcs.WABI.2017.20

Departments

Department of Theoretical Computer Science

Annotation

In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern $m$. A simple $\mathcal{O}(m)$-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block $q$-gram inverted index. The block $q$-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.

Technology Beats Algorithms (in Exact String Matching)

Authors

Tarhio, J; Holub, J.; Giaquinta, E

Year

2017

Published

Software - Practice and Experience. 2017, 47(12), 1877-1885. ISSN 0038-0644.

Type

Article

DOI

10.1002/spe.2511

Departments

Department of Theoretical Computer Science

Annotation

More than 120 algorithms have been developed for exact string matching within the last 40 years. We show by experiments that the \naive{} algorithm exploiting SIMD instructions of modern CPUs (with symbols compared in a special order) is the fastest one for patterns of length up to about 50 symbols and extremely good for longer patterns and small alphabets. The algorithm compares 16 or 32 characters in parallel by applying SSE2 or AVX2 instructions, respectively. Moreover, it uses loop peeling to further speed up the searching phase. We tried several orders for comparisons of pattern symbols and the increasing order of their probabilities in the text was the best.

Computing All Approximate Enhanced Covers with the Hamming Distance

Authors

Guth, O.

Year

2016

Published

Proceedings of the Prague Stringology Conference 2016. Praha: Czech Technical University in Prague, 2016. p. 146-157. ISBN 978-80-01-05996-8.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

A border p of a string x is an enhanced cover of x if the number of positions of x that lie within some occurrence of p is the maximum among all borders of x. In this paper, more general notion based on the enhanced cover is introduced: a k-approximate enhanced cover, where fixed maximum number of errors k in the Hamming distance is considered. The k-approximate enhanced cover of x is its border and its k-approximate occurrences are also considered in the covered number of positions of x. An O(n^2)-time and a O(n)-space algorithm that computes all k-approximate enhanced covers of a string of length n is presented.

Approximate String Matching for Self-Indexes

Authors

Hrbek, L.; Holub, J.

Year

2016

Published

Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 604. ISSN 1068-0314. ISBN 978-1-5090-1853-6.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Self-index is a data structure that allows to find all exact occurrences of given pattern of length $m$ in $m$ steps. The paper presents an approach how to use self-index for approximate string matching with no more than $k$~differences (given by Hamming or Levenshtein distance). Our approach solution uses filtering based on the pigeonhole principle. For implementation we have selected FM-index. Our program is faster than BLAST for large texts (hundreds of MB) and more space efficient for shorter texts.

Positional Inverted Self-index

Authors

Procházka, P.; Holub, J.

Year

2016

Published

Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 627. ISSN 1068-0314. ISBN 978-1-5090-1853-6.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade-off between compression effectiveness and the decompression speed can be tuned by different parameters of the index algorithm. ⌉ ×

Incremental Locality & Clustering-Based Compression

Authors

Krčál, L.; Holub, J.

Year

2015

Published

Proceedings of Data Compression Conference 2015. Los Alamitos: IEEE Computer Society Press, 2015. pp. 203-212. ISSN 1068-0314. ISBN 978-1-4799-8430-5.

Type

Proceedings paper

DOI

10.1109/DCC.2015.23

Departments

Department of Theoretical Computer Science

Annotation

Current compression solutions either use a limited size locality-based context or the entire input, to which the compressors adapt. This results in suboptimal compression effectiveness due to missing similarities further apart in the former case, or due to too generic adaptation. There are many deduplication and near deduplication systems that search for similarity across the entire input. Although most of these systems excel with their simplicity and speed, none of those goes deeper in terms of full-scale redundancy removal. We propose a novel compression and archival system called ICBCS. Our system goes beyond standard measures for similarity detection, using extended similarity hash and incremental clustering techniques to determine groups of sufficiently similar chunks designated for compression. ICBCS outperforms conventional file compression solutions on datasets consisting of at least mildly redundant files. It also shows that selective application of weak compressor results in better compression ratio and speed than conventional application of a strong compressor.

Compression of a Set of Files with Natural Language Content

Authors

Procházka, P.; Holub, J.

Year

2015

Published

Computer Journal. 2015, 58(5), 1169-1185. ISSN 0010-4620.

Type

Article

DOI

10.1093/comjnl/bxu046

Departments

Department of Theoretical Computer Science

Annotation

An algorithm for very efficient compression of a set of natural language text files is presented. Not only a very good compression ratio is reached, the used compression method allows fast pattern matching in compressed text, which is an attractive property especially for search engines. Much information is stored in the form of a large collection of text files. The web search engines can store the web pages in the raw text form to build so-called snippets or to perform so-called positional ranking functions on them. Furthermore, there exist many other similar contexts such as the storage of emails, application logs or the databases of text files (literary works or technical reports). In this paper, we address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm is based on a word-based approach and the idea of combination of two statistical models: global model (common for all the files of the set) and local model. The latter is built as a set of changes that transform the global model to the proper model of the single compressed file.

Compressing Similar Biological Sequences using FM-index

Authors

Procházka, P.; Holub, J.

Year

2014

Published

Proceedings of Data Compression Conference 2014. Los Alamitos: IEEE Computer Society, 2014. p. 312-321. ISSN 1068-0314. ISBN 978-1-4799-3882-7.

Type

Proceedings paper

DOI

10.1109/DCC.2014.47

Departments

Department of Theoretical Computer Science

Annotation

Nowadays, decreasing cost and better accessibility of sequencing methods have enabled studies of genetic variation between individuals of the same species and also between two related species. This has led to a rapid increase in biological data consisting of sequences that are very similar to each other, these sequences usually being stored together in one database. We propose a compression method based on Wavelet Tree FM-index optimized for compression of a set of similar biological sequences. The compression method is based on tracking single changes (together with their context) between every single sequence and the chosen reference sequence. We call our compression method BIO-FMI. It gives very promising results in compression ratio and in locate time when performed on an extremely repetitive data set (less than 0.5 % mutations) and when the searched patterns are of smaller lengths (less than 20 bases). BIO-FMI is competitive in extraction speed and it seems to be superior in time needed to build the index, especially in the case when the alignments of single sequences are given in advance.

Suffix Tree of Alignment: An Efficient Index for Similar Data

Authors

Na, J. C.; Park, H.; Crochemore, M.; Holub, J.; Iliopoulos, C. S.; Mouchard, L.; Park, K.

Year

2013

Published

Proceedings of the 24th Workshop on Combinatorial Algorithms (IWOCA 2013). Berlin: Springer, 2013. pp. 337-348. ISSN 0302-9743. ISBN 978-3-642-45277-2.

Type

Proceedings paper

DOI

10.1007/978-3-642-45278-9_29

Departments

Department of Theoretical Computer Science

Annotation

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A|+|B| leaves and can be constructed in O(|A|+|B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A|+ld +l1 leaves where ld is the sum of the lengths of all parts of B different from A and l1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P | + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + ld + l1 + l2 ) time where l2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(ld + l1 + l2 ) time.

Natural Language Compression Optimized for Large Set of Files

Authors

Procházka, P.; Holub, J.

Year

2013

Published

Proceedings of Data Compression Conference 2013. Los Alamitos, CA: IEEE Computer Soc., 2013. p. 514. ISSN 1068-0314. ISBN 978-0-7695-4965-1.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The paper describes one of the typical scenarios of data compression for large collections of files. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time.

ODC: Frame for definition of Dense codes

Authors

Procházka, P.; Holub, J.

Year

2013

Published

European Journal of Combinatorics. 2013, 2013(1), 52-68. ISSN 0195-6698.

Type

Article

DOI

10.1016/j.ejc.2012.07.014

Departments

Department of Theoretical Computer Science

Annotation

The natural language compression made a great progress in last two decades. The main step in this evolution was the introduction of word-based compression by Moffat. Another improvement came with so called Dense codes which proved to be very fast in compression and decompression while keeping good compression ratio and direct search capability. Many variants of the Dense codes were described, each of them using its own definition. In this paper we present generalized concept of dense coding called Open Dense Code (ODC) which aims to be a frame for definition of many other dense code schemas. ODC underlines common features of the dense code schemas but at the same time allows to express the divergences of each of them. Using the frame of ODC we present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (STDC). Our algorithms improve the compression ratio and are considerate to smaller files which are very often omitted.

On left and right seeds of a string

Authors

Christou, M.; Crochemore, M.; Guth, O.; Iliopoulos, C.S.; Pissis, S.P.

Year

2012

Published

Journal of Discrete Algorithms. 2012, 17 31-44. ISSN 1570-8667.

Type

Article

DOI

10.1016/j.jda.2012.10.004

Departments

Department of Theoretical Computer Science

Annotation

We consider the problem of finding the repetitive structure of a given string y of length n. A factor u of y is a cover of y, if every letter of y lies within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. A left seed of y is a prefix of y, that is a cover of a superstring of y. Similarly, a right seed of y is a suffix of y, that is a cover of a superstring of y. An integer array LS is the minimal left-seed (resp. maximal left-seed) array of y, if LS[i] is the minimal (resp. maximal) length of left seeds of y[0..i]. The minimal right-seed (resp. maximal right-seed) arrayRS of y is defined in a similar fashion. In this article, we present linear-time algorithms for computing all left and right seeds of y, a linear-time algorithm for computing the minimal left-seed array of y, a linear-time solution for computing the maximal left-seed array of y, an O(n log n)-time algorithm for computing the minimal right-seed array of y, and a linear-time solution for computing the maximal right-seed array of y. All algorithms use linear auxiliary space.

The Finite Automata Approaches in Stringology

Authors

Holub, J.

Year

2012

Published

Kybernetika. 2012, 48(3), 386-401. ISSN 0023-5954.

Type

Article

Departments

Department of Theoretical Computer Science

Annotation

We present an overview of four approaches of the finite automata use in stringology: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. We also show how the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables).

Tree-Based 2D Indexing

Authors

Žďárek, J.; Melichar, B.

Year

2011

Published

International Journal of Foundations of Computer Science. 2011, 22(8), 1893-1907. ISSN 0129-0541.

Type

Article

DOI

10.1142/S0129054111009100

Departments

Department of Theoretical Computer Science

Annotation

A new approach to the 2D pattern matching and specifically to 2D text indexing is proposed. A transformation of a 2D text into the form of a tree is presented. It preserves the context of each element of the 2D text. The tree can be linearised using the prefix notation into the form of a string (a linear text) and the pattern matching is performed in this text. Pushdown automata indexing the 2D text are constructed over the tree representation. They allow to search for 2D prefixes, 2D suffixes, and 2D factors of the 2D text in time proportional to the size of the representation of a 2D pattern. This result achieves the properties analogous to the results obtained in tree pattern matching and string indexing.

Finite Automata for Generalized Approach to Backward Pattern Matching

Authors

Antoš, A.J.; Melichar, B.

Year

2011

Published

Proceedings of the 15th International Conference on Implementation and Application of Automata. Heidelberg: Springer, 2011. pp. 49-58. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-18097-2.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We generalized the DAWG backward pattern matching approach to be able to solve a broad range of pattern matching problems. We use a definition of a class of problems. We describe a finite automaton for the basic pattern matching problem of finding an exact occurrence of one string in a text. We propose a mechanism to use simple operations over finite automata in a systematic approach to derive automata for solving problems from a defined class, such as approximate matching, regular expression matching, sequence matching, matching of set of patterns, etc. and their combinations. The benefit of this approach is the ability to quickly derive solutions for newly formulated pattern matching problems.

On the Right-Seed Array of a String

Authors

Christou, M.; Crochemore, M.; Guth, O.; Iliopoulos, CS; Pissis, SP

Year

2011

Published

Proceedings of the 17th Annual International Computing and Combinatorics Conference (COCOON 2011). Berlin: Springer-Verlag, 2011. p. 492-502. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-22684-7.

Type

Proceedings paper

DOI

10.1007/978-3-642-22685-4_43

Departments

Department of Theoretical Computer Science

Annotation

We consider the problem of finding the repetitive structure of a given fixed string y. A factor u of y is a cover of y, if every letter of y falls within some occurrence of u in y. A factor v of y is a seed of y, if it is a cover of a superstring of y. There exist linear-time algorithms for solving the minimal cover problem. The minimal seed problem is of much higher algorithmic difficulty, and no linear-time algorithm is known. In this article, we solve one of its variants - computing the minimal and maximal right-seed array of a given string. A right seed of y is the shortest suffix of y that it is a cover of a superstring of y. An integer array RS is the minimal right-seed (resp. maximal right-seed) array of y, if RS [i] is the minimal (resp. maximal) length of right seeds of y[0..i]. We present an O(n log n) time algorithm that computes the minimal right-seed array of a given string, and a linear-time solution to compute the maximal right-seed array.

Approximate string-matching with a single gap for sequence alignment

Authors

Flouri, T.

Year

2011

Published

ACM BCB 2011: ACM Conference on Bioinformatics, Computational Biology and Biomedicine. New York: ACM, 2011. pp. 490-492. ISBN 978-1-4503-0796-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

This paper deals with the approximate string-matching problem with Hamming distance and a single gap for sequence alignment. We consider an extension of the approximate string-matching problem with Hamming distance, by also allowing the existence of a single gap, either in the text, or in the pattern. This problem is strongly and directly motivated by the next-generation re-sequencing procedure. We present a general algorithm that requires O(nm) time, where n is the length of the text and m is the length of the pattern, but this can be reduced to O(mβ) time, if the maximum length β of the gap is given.

DynMap : mapping short reads to multiple related genomes

Authors

Flouri, T.

Year

2011

Published

ACM BCB 2011: ACM Conference on Bioinformatics, Computational Biology and Biomedicine. New York: ACM, 2011. pp. 330-334. ISBN 978-1-4503-0796-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The constant advances in sequencing technology have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) during a single experiment, and with a much lower cost than previously possible. Due to the dramatic increase in the amount of data generated, efficient algorithms for aligning (mapping) these reads to reference genomes are in great demand, and recently, there has been ample work for publishing such algorithms. In this paper, we study a different version of this problem; mapping these reads to multiple related genomes (e.g. individuals of the same species). We present DynMap, a new practical algorithm, which employs a suitable data structure that takes into account potential inherent genomic variability (replacements, insertions, deletions) between related genomes. Therefore, if a small number of differences occurs within a reference sequence, the already mapped reads can be altered dynamically. The presented experimental results demonstrate that DynMap can match or even outperform the most popular tools in terms of sensitivity, accuracy, and speed.

Mapping Short Reads to a Genomic Sequence with Circular Structure

Authors

Flouri, T.; Iliopoulos, C.S.; Pissis, S.P.; Tischler, G.

Year

2012

Published

International Journal of Systems Biology and Biomedical Technologies. 2012, 2012(1), 26-34. ISSN 2160-9586.

Type

Article

DOI

10.4018/ijsbbt.2012010103

Departments

Department of Theoretical Computer Science

Annotation

Constant advances in DNA sequencing technologies are turning whole-genome sequencing into a routine procedure, resulting in massive amounts of data that need to be processed. Tens of gigabytes of data, in the form of short sequences (reads), need to be mapped back onto reference sequences, a few gigabases long. A first generation of short-read alignment algorithms successfully employed hash tables, and the current second generation uses the Burrows-Wheeler transform, further improving speed and memory footprint. These next-generation sequencing technologies allow researchers to characterise a bacterial genome, during a single experiment, at a moderate cost. In this article, as most of the bacterial chromosomes contain a circular DNA molecule, the authors present a new simple, yet efficient, sensitive and accurate algorithm, specifically designed for mapping millions of short reads to a genomic sequence with circular structure.

Finite Automata in Pattern Matching

Authors

Holub, J.

Year

2011

Published

Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications. New York: J. Willey, 2011. p. 51-71. ISBN 978-0-470-50519-9.

Type

Book chapter

Departments

Department of Theoretical Computer Science

Annotation

Stringology is a part of Computer Science dealing with processing strings and sequences. It finds many important applications in various fields used by Information Society. Biological sequences processing is one of the most important fields. On the other hand the finite automata theory is a well developed formal system used for a long time in the area of compiler construction. The chapter aims to show various approaches of the finite automata use in Stringology. The approaches are demonstrated on practical examples. Of course it is impossible to describe all the approaches as it would be out of scope of the chapter. However, we would like to present basic approaches that the reader can modify and combine to a given task.

Lossless Data Compression Testbed: ExCom and Prague Corpus

Authors

Holub, J.; Řezníček, Jakub

Year

2011

Published

Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 457. ISSN 1068-0314. ISBN 978-0-7695-4352-9.

Type

Proceedings paper

DOI

10.1109/DCC.2011.61

Departments

Department of Theoretical Computer Science

Annotation

Authors developing new algorithms have to compare their implementations with the implementations of the other algorithms. They can do a testing on standard corpora and compare the achieved compression ratio with published compression ratios made on the same corpus. The corpora are static collections of files so they are getting older and do not contain new file types that appear. Moreover one can compare only the compression ratio. The times of compression and decompression are very important as well. In order to compare the times one have to run the implementation on the same platform and the same hardware. So one have to ask the authors for the source codes and incorporate them into his system. Then the tests could be run on the same machine.

Block-oriented Dense Compressor

Authors

Procházka, P.; Holub, J.

Year

2011

Published

Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 372. ISSN 1068-0314. ISBN 978-0-7695-4352-9.

Type

Proceedings paper

DOI

10.1109/DCC.2011.76

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of block-oriented natural language compression. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on Dense Code idea. It achieves very good compression ratio around 32 % on natural language text. We show that our compression method has some interesting properties which could be applied in digital libraries and other textual databases. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with corresponding vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.

Natural Language Compression per Blocks

Authors

Procházka, P.; Holub, J.

Year

2011

Published

First International Conference on Data Compression, Communications and Processing. Los Alamitos: IEEE Computer Society, 2011. pp. 67-75. ISBN 978-0-7695-4528-8.

Type

Proceedings paper

DOI

10.1109/CCP.2011.25

Departments

Department of Theoretical Computer Science

Annotation

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed is the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture.

An algorithm for mapping short reads to a dynamically changing genomic sequence

Authors

Flouri, T.; Holub, J.; Iliopoulos, C.S.I.; Pissis, S.P.P.

Year

2010

Published

2010 IEEE International Conference on Bioinformatics and Biomedicine. Los Alamitos: IEEE Computer Society, 2010. p. 133-136. ISBN 978-1-4244-8305-1.

Type

Proceedings paper

DOI

10.1109/BIBM.2010.5706551

Departments

Department of Theoretical Computer Science

Annotation

The constant advances in sequencing technology have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these reads to reference sequences are in great demand, and recently, there has been ample work for publishing such algorithms. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be applied to address the problem of mapping millions of reads to multiple genomic sequences.

Improving Practical Exact String Matching

Authors

Ďurian, B.; Holub, J.; Peltola, H.; Tarhio, J.

Year

2010

Published

Information Processing Letters. 2010, 110(4), 148-152. ISSN 0020-0190.

Type

Article

DOI

10.1016/j.ipl.2009.11.010

Departments

Department of Theoretical Computer Science

Annotation

We present improved variations of the BNDM algorithm for exact string matching. At each alignment the algorithms read a $q$-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction.

Using Finite Automata Approach for Searching Approximate Seeds of Strings

Authors

Guth, O.; Melichar, B.

Year

2010

Published

Intelligent Automation and Computer Engineering. New York: Springer, 2010. p. 347-360. ISSN 1876-1100. ISBN 978-90-481-3516-5.

Type

Book chapter

DOI

10.1007/978-90-481-3517-2_27

Departments

Department of Theoretical Computer Science

Annotation

Seed is a type of a regularity of strings. A restricted approximate seed w of string T is a factor of T such that w covers a superstring of T under some distance rule. In this paper, the problem of searching of all restricted seeds with the smallest Hamming distance is studied and a polynomial time and space algorithm for solving the problem is presented. It searches for all restricted approximate seeds of a string with given limited approximation using Hamming distance and it computes the smallest distance for each found seed. The solution is based on a finite (suffix) automata approach that provides a straightforward way to design algorithms to many problems in stringology. Therefore, it is shown that the set of problems solvable using finite automata includes the one studied in this paper.

A Note on a Tree-based 2D Indexing

Authors

Žďárek, J.; Melichar, B.

Year

2010

Published

Pre-proceedings of Conference on Implementation and Application of Automata 2010. Manitoba: University of Manitoba, 2010. pp. 269-277.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

A transformation of 2D structures into the form of a tree, their linearisation and constructions of pushdown automata indexing this representation of a 2D text are presented.

A Different Proof of the Crochemore-Ilie Lemma Concerning Microruns

Authors

Franek, F.; Holub, J.

Year

2009

Published

London Algorithmics 2008: Theory and Practice. London: College Publications, 2009. pp. 1-9. ISBN 978-1-904987-97-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In a resent paper, Crochemore and Ilie improved the upper bound of the maxrun function ro(n)=max{r(x) : |x| = n}, where r(x) denotes the number of runs in a string x, to 1.6n (computational results on Ilie's web page claim the improvement to 1.048n.) Their proof is essentially in two major steps: first step shows that there is at most 1center of a d-run on average in each interval of length d, in the second step they showed that ro9(n) <= n, where ro9(n) is the count limited to runs of size <= 9. The proof of ro9(n) <= n they presented relies on computation. We present an alternative simpler proof, also computational, to provide an independent verification of Crochemore-Ilie's result. The method used is completely different and has a potential to be verified without an aid of computer.

On Parallel Implementations of Deterministic Finite Automata

Authors

Holub, J.; Štekr, S.

Year

2009

Published

Implementation and Application of Automata. Heidelberg: Springer, 2009. p. 54-64. ISSN 0302-9743. ISBN 978-3-642-02978-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We present implementations of parallel DFA run methods and find whether and under what conditions is worthy to use the parallel methods of simulation of run of finite automata. First, we introduce the parallel DFA run methods for general DFA, which are universal, but due to the dependency of simulation time on the number of states vertical bar Q vertical bar of automaton being run, they are suitable only for run of automata with the smaller number of states. Then we show that if we apply some restrictions to properties of automata being run, we can reach the linear speedup compared to the sequential simulation method. We designed methods benefiting from k-locality that allows optimum parallel run of exact and approximate pattern matching automata. Finally, we show the results of experiments conducted on two types of parallel computers (Cluster of workstations and Symmetric shared-memory multiprocessors).

New Word-based Adaptive Dense Compressors

Authors

Procházka, P.; Holub, J.

Year

2009

Published

Combinatorial Algorithms. Berlin: Springer, 2009. pp. 420-431. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-10216-5.

Type

Proceedings paper

DOI

10.1007/978-3-642-10217-2_41

Departments

Department of Theoretical Computer Science

Annotation

In the last two decades the natural language compression made a great progress. The main step in this evolution was the introduction of word-based compression by Mofiat. The word-based statistical compression algorithms are able to achieve 35% improvement in the compression ratio in comparison with character-based ones. We present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (SCDC). TBDC uses the codewords with maximal size 2 bytes and must be implemented with some pruning technique. STDC is able to tune its code space during the compression process and so achieve better compression. Our algorithms improve the compression ratio and are considerate to smaller ffles which are very often omitted. We present also a generalized concept of dense coding called Open Dense Code (ODC) which provides a frame for deffnition of these two and many other dense code schemas.