prof. Ing. Jan Holub, Ph.D.

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Authors

Bohuslavová, D.; Ahmed, O.; Gagie, T.; Holub, J.; Langmead, B.; Manzini, G.; Navarro, G.

Year

2024

Published

Symposium on Experimental Algorithms. Dagstuhl: Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2024. p. 10:1-10:13. vol. 301. ISSN 1868-8969.

Type

Proceedings paper

DOI

10.4230/LIPIcs.SEA.2024.10

Departments

Department of Theoretical Computer Science

Annotation

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes’ concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any kmax-tuple, for a parameter kmax; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads’ longest MEMs occurred only in the sequences from which those reads were generated (“true positive” reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Backward Pattern Matching on Elastic-Degenerate Strings

Authors

Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.

Year

2023

Published

SN Computer Science. 2023, 4(5), ISSN 2662-995X.

Type

Article

DOI

10.1007/s42979-023-01760-x

Departments

Department of Theoretical Computer Science

Annotation

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd w m e) worst-case time. This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays

Authors

Krčál, L.; Ho, S.-S.; Holub, J.

Year

2022

Published

Database Systems for Advanced Applications. Springer, Cham, 2022. p. 509-525. Lecture Notes in Computer Science. vol. 13245. ISSN 0302-9743. ISBN 978-3-031-00122-2.

Type

Proceedings paper

DOI

10.1007/978-3-031-00123-9_40

Departments

Department of Theoretical Computer Science

Annotation

Bitmap indices are widely used in commercial databases for processing complex queries, due to their efficient use of bit-wise operations. Bitmap indices apply natively to relational and linear datasets, with distinct separation of the columns or attributes, but do not perform well on multidimensional array scientific data. We propose a new method for multidimensional array indexing that considers the spatial component of multidimensional arrays. The hierarchical indexing method is based on sparse n-dimensional trees for dimension partitioning, and bitmap indexing with adaptive binning for attribute partitioning. This indexing performs well on range queries involving both dimension and attribute constraints, as it prunes the search space early. Moreover, the indexing is easily extensible to membership queries. The indexing method was implemented on top of a state of the art bitmap indexing library Fastbit, using tables partitioned along any subset of the data dimensions. We show that the hierarchical bitmap index outperforms conventional bitmap indexing, where an auxiliary attribute is required for each dimension. Furthermore, the adaptive binning significantly reduces the amount of bins and therefore memory requirements.

Backward Pattern Matching on Elastic Degenerate Strings

Authors

Procházka, P.; Cvacho, O.; Krčál, L.; Holub, J.

Year

2021

Published

Proceedings of 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2021. p. 50-59. vol. 3. ISSN 2184-4305. ISBN 978-989-758-490-9.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O (Nmd m w e) worst-case time. This implies O (Nm) time for small patterns, where m is the length of the searched pattern, N is the size of EDS, and w is the size of the computer word. The algorithm uses O (N + n) space, where n is the length of EDS. BNDM-EDS requires a simple preprocessing step with time and space O (m). Experimental results on real genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.

Data Structures to Represent a Set of k-long DNA Sequences

Authors

Chikhi, R.; Holub, J.; Medvedev, P.

Year

2021

Published

ACM Computing Surveys. 2021, 54(1), 17:1-17:22. ISSN 0360-0300.

Type

Article

DOI

10.1145/3445967

Departments

Department of Theoretical Computer Science

Annotation

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

PFP Compressed Suffix Trees

Authors

Boucher, C.; Cvacho, O.; Gagie, T.; Holub, J.; Manzini, G.; Navarro, G.; Rossi, M.

Year

2021

Published

Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX2021). Philadelphia: SIAM, 2021. p. 60-72. ISSN 2164-0300. ISBN 978-1-61197-647-2.

Type

Proceedings paper

DOI

10.1137/1.9781611976472.5

Departments

Department of Theoretical Computer Science

Annotation

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to sup- port full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

On-line Searching in IUPAC nucleotide sequences

Authors

Procházka, P.; Holub, J.

Year

2019

Published

Proceedings of 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019). Lisboa: SCITEPRESS – Science and Technology Publications, Lda, 2019. p. 66-77. vol. 3. ISSN 2184-4305. ISBN 978-989-758-353-7.

Type

Proceedings paper

DOI

10.5220/0007382900660077

Departments

Department of Theoretical Computer Science

Annotation

We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O (m + mα2 log m)-space data structure and O(mα2 ) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O (nm2α4 ) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.

Filtering Invalid Off-Targets in CRISPR/Cas9 Design Tools

Authors

Cvacho, O.; Holub, J.

Year

2018

Published

Proceedings of Data Compression Conference 2018. New York: IEEE Computer Society Press, 2018. p. 403. ISSN 2375-0359. ISBN 978-1-5386-4883-4.

Type

Proceedings paper

DOI

10.1109/DCC.2018.00056

Departments

Department of Theoretical Computer Science

Annotation

The paper deals with an approximate string matching in CRISPR/Cas9 design tools. The approximate string matching is transformed into sequence of exact string matching tasks processed by fast succinct (compressed) self-index. The number of the tasks is rapidly decreased by filtering technique based on de Bruijn graph.

Reconstructing a String from its Lyndon Arrays

Authors

Daykin, J.W.; Franek, F; Holub, J.; Sohidull Islam, A.S.M.; Smyth, W.F.

Year

2018

Published

Theoretical Computer Science. 2018, 710 44-51. ISSN 0304-3975.

Type

Article

DOI

10.1016/j.tcs.2017.04.008

Departments

Department of Theoretical Computer Science

Annotation

Given a string x = x[1.. n] on an ordered alphabet of size σ , the Lyndon array λ = λx [1..n] of x is an array of positive integers such that λ[i], 1 ≤ i ≤ n, is the length of the maximal Lyndon word over the ordering of that begins at position i in x. The Lyndon array has recently attracted considerable attention due to its pivotal role in establishing the long-standing conjecture that ρ (n ) < n, where ρ ( n) is the maximum number of maximal periodicities (runs) in any string of length n. Here we first describe two linear-time algorithms that, given a valid Lyndon array λ, compute a corresponding string — one for an alphabet of size n, the other for a smaller alphabet. We go on to describe another linear-time algorithm that determines whether or not a given integer array is a Lyndon array of some string. Finally we show how σ Lyndon arrays λ = {λ1 = λ, λ2 , . . . , λσ } corresponding to σ “rotations” of the alphabet can be used to determine uniquely the string x on such that λx = λ.

SOPanG: online text searching over a pan-genome

Authors

Cislak, A.; Grabowski, S.; Holub, J.

Year

2018

Published

Bioinformatics. 2018, 34(24), 4290-4292. ISSN 1367-4803.

Type

Article

DOI

10.1093/bioinformatics/bty506

Departments

Department of Theoretical Computer Science

Annotation

Motivation: The many thousands of high-quality genomes available now-a-days imply a shift from single genome to pan-genomic analyses. A basic algorithmic building brick for such a scenario is online search over a collection of similar texts, a problem with surprisingly few solutions presented so far.

Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Authors

Procházka, P.; Holub, J.

Year

2017

Published

Proceedings of 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Saarbrücken: Dagstuhl Publishing,, 2017. p. 20:1-20:13. ISSN 1868-8969. ISBN 978-3-95977-050-7.

Type

Proceedings paper

DOI

10.4230/LIPIcs.WABI.2017.20

Departments

Department of Theoretical Computer Science

Annotation

In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern $m$. A simple $\mathcal{O}(m)$-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block $q$-gram inverted index. The block $q$-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.

Technology Beats Algorithms (in Exact String Matching)

Authors

Tarhio, J; Holub, J.; Giaquinta, E

Year

2017

Published

Software - Practice and Experience. 2017, 47(12), 1877-1885. ISSN 0038-0644.

Type

Article

DOI

10.1002/spe.2511

Departments

Department of Theoretical Computer Science

Annotation

More than 120 algorithms have been developed for exact string matching within the last 40 years. We show by experiments that the \naive{} algorithm exploiting SIMD instructions of modern CPUs (with symbols compared in a special order) is the fastest one for patterns of length up to about 50 symbols and extremely good for longer patterns and small alphabets. The algorithm compares 16 or 32 characters in parallel by applying SSE2 or AVX2 instructions, respectively. Moreover, it uses loop peeling to further speed up the searching phase. We tried several orders for comparisons of pattern symbols and the increasing order of their probabilities in the text was the best.

Towards Efficient Positional Inverted Index

Authors

Procházka, P.; Holub, J.

Year

2017

Published

Algorithms. 2017, 10(1), ISSN 1999-4893.

Type

Article

DOI

10.3390/a10010030

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N−n)⌈log2bN⌉+(⌊N−nα⌋+n)×(⌈log2bn⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑Ni=1⌈log2pn(i)i⌉−∑N′j=1⌈log2p′j⌉+N) with respect to pn(i)i as a probability of a stem variant at position i, p′j as the probability of separator or stop word at position j and N′ as the number of separators and stop words

Approximate String Matching for Self-Indexes

Authors

Hrbek, L.; Holub, J.

Year

2016

Published

Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 604. ISSN 1068-0314. ISBN 978-1-5090-1853-6.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Self-index is a data structure that allows to find all exact occurrences of given pattern of length $m$ in $m$ steps. The paper presents an approach how to use self-index for approximate string matching with no more than $k$~differences (given by Hamming or Levenshtein distance). Our approach solution uses filtering based on the pigeonhole principle. For implementation we have selected FM-index. Our program is faster than BLAST for large texts (hundreds of MB) and more space efficient for shorter texts.

Positional Inverted Self-index

Authors

Procházka, P.; Holub, J.

Year

2016

Published

Proceedings of Data Compression Conference 2016. New York: IEEE Computer Society Press, 2016. pp. 627. ISSN 1068-0314. ISBN 978-1-5090-1853-6.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade-off between compression effectiveness and the decompression speed can be tuned by different parameters of the index algorithm. ⌉ ×

Towards Efficient Positional Inverted Index

Authors

Procházka, P.; Holub, J.

Year

2016

Published

Proceedings of International Conference on Data Compression, Communication, Processing and Security 2016. Salerno: Universita di Salerno, 2016.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Abstract—We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The trade- off between compression effectiveness and decompression speed can be tuned by different parameters of the index algorithm. The space complexity of the PISI inverted list is O((N − n)⌈log 2b N ⌉+(⌊ N α −n ⌋+n)×(⌈log2b n⌉+1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given ∑ in N bits. The n(i) space complexity N′′ of the presentation layer is O(− i=1 ⌈log2 pi ⌉− j=1 ⌈log2 pj ⌉+N ) n(i) with respect to pi as a probability of a stem variant at position i, p′ j as the probability of separator or stop word at ′position j and N as the number of separators and stop words. Our experiments prove that PISI is close to standard positional inverted index in terms of search speed and, at the same time, it is more efficient in memory consumption. PISI also proved that it is significantly faster than its close competitor FWCSA [17] in terms of search speed at the same level of memory consumption.

Compression of a Set of Files with Natural Language Content

Authors

Procházka, P.; Holub, J.

Year

2015

Published

Computer Journal. 2015, 58(5), 1169-1185. ISSN 0010-4620.

Type

Article

DOI

10.1093/comjnl/bxu046

Departments

Department of Theoretical Computer Science

Annotation

An algorithm for very efficient compression of a set of natural language text files is presented. Not only a very good compression ratio is reached, the used compression method allows fast pattern matching in compressed text, which is an attractive property especially for search engines. Much information is stored in the form of a large collection of text files. The web search engines can store the web pages in the raw text form to build so-called snippets or to perform so-called positional ranking functions on them. Furthermore, there exist many other similar contexts such as the storage of emails, application logs or the databases of text files (literary works or technical reports). In this paper, we address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm is based on a word-based approach and the idea of combination of two statistical models: global model (common for all the files of the set) and local model. The latter is built as a set of changes that transform the global model to the proper model of the single compressed file.

Incremental Locality & Clustering-Based Compression

Authors

Krčál, L.; Holub, J.

Year

2015

Published

Proceedings of Data Compression Conference 2015. Los Alamitos: IEEE Computer Society Press, 2015. pp. 203-212. ISSN 1068-0314. ISBN 978-1-4799-8430-5.

Type

Proceedings paper

DOI

10.1109/DCC.2015.23

Departments

Department of Theoretical Computer Science

Annotation

Current compression solutions either use a limited size locality-based context or the entire input, to which the compressors adapt. This results in suboptimal compression effectiveness due to missing similarities further apart in the former case, or due to too generic adaptation. There are many deduplication and near deduplication systems that search for similarity across the entire input. Although most of these systems excel with their simplicity and speed, none of those goes deeper in terms of full-scale redundancy removal. We propose a novel compression and archival system called ICBCS. Our system goes beyond standard measures for similarity detection, using extended similarity hash and incremental clustering techniques to determine groups of sufficiently similar chunks designated for compression. ICBCS outperforms conventional file compression solutions on datasets consisting of at least mildly redundant files. It also shows that selective application of weak compressor results in better compression ratio and speed than conventional application of a strong compressor.

Compressing Similar Biological Sequences using FM-index

Authors

Procházka, P.; Holub, J.

Year

2014

Published

Proceedings of Data Compression Conference 2014. Los Alamitos: IEEE Computer Society, 2014. p. 312-321. ISSN 1068-0314. ISBN 978-1-4799-3882-7.

Type

Proceedings paper

DOI

10.1109/DCC.2014.47

Departments

Department of Theoretical Computer Science

Annotation

Nowadays, decreasing cost and better accessibility of sequencing methods have enabled studies of genetic variation between individuals of the same species and also between two related species. This has led to a rapid increase in biological data consisting of sequences that are very similar to each other, these sequences usually being stored together in one database. We propose a compression method based on Wavelet Tree FM-index optimized for compression of a set of similar biological sequences. The compression method is based on tracking single changes (together with their context) between every single sequence and the chosen reference sequence. We call our compression method BIO-FMI. It gives very promising results in compression ratio and in locate time when performed on an extremely repetitive data set (less than 0.5 % mutations) and when the searched patterns are of smaller lengths (less than 20 bases). BIO-FMI is competitive in extraction speed and it seems to be superior in time needed to build the index, especially in the case when the alignments of single sequences are given in advance.

ACM Student Project of the Year 2013 Competition

Authors

Holub, J.

Year

2013

Published

Information Sciences and Technologies, Bulletin of the ACM Slovakia. 2013, 5(4), 55-56. ISSN 1338-1237.

Type

Article

Departments

Department of Theoretical Computer Science

Annotation

The paper reports on finals of ACM Student Project of the Year 2013 Competition.

Natural Language Compression Optimized for Large Set of Files

Authors

Procházka, P.; Holub, J.

Year

2013

Published

Proceedings of Data Compression Conference 2013. Los Alamitos, CA: IEEE Computer Soc., 2013. p. 514. ISSN 1068-0314. ISBN 978-0-7695-4965-1.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The paper describes one of the typical scenarios of data compression for large collections of files. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time.

ODC: Frame for definition of Dense codes

Authors

Procházka, P.; Holub, J.

Year

2013

Published

European Journal of Combinatorics. 2013, 2013(1), 52-68. ISSN 0195-6698.

Type

Article

DOI

10.1016/j.ejc.2012.07.014

Departments

Department of Theoretical Computer Science

Annotation

The natural language compression made a great progress in last two decades. The main step in this evolution was the introduction of word-based compression by Moffat. Another improvement came with so called Dense codes which proved to be very fast in compression and decompression while keeping good compression ratio and direct search capability. Many variants of the Dense codes were described, each of them using its own definition. In this paper we present generalized concept of dense coding called Open Dense Code (ODC) which aims to be a frame for definition of many other dense code schemas. ODC underlines common features of the dense code schemas but at the same time allows to express the divergences of each of them. Using the frame of ODC we present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (STDC). Our algorithms improve the compression ratio and are considerate to smaller files which are very often omitted.

Suffix Tree of Alignment: An Efficient Index for Similar Data

Authors

Na, J. C.; Park, H.; Crochemore, M.; Holub, J.; Iliopoulos, C. S.; Mouchard, L.; Park, K.

Year

2013

Published

Proceedings of the 24th Workshop on Combinatorial Algorithms (IWOCA 2013). Berlin: Springer, 2013. pp. 337-348. ISSN 0302-9743. ISBN 978-3-642-45277-2.

Type

Proceedings paper

DOI

10.1007/978-3-642-45278-9_29

Departments

Department of Theoretical Computer Science

Annotation

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A|+|B| leaves and can be constructed in O(|A|+|B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A|+ld +l1 leaves where ld is the sum of the lengths of all parts of B different from A and l1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P | + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + ld + l1 + l2 ) time where l2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(ld + l1 + l2 ) time.

The Finite Automata Approaches for Bioinformatics

Authors

Holub, J.

Year

2012

Published

Proceedings of 22nd Theorietag Automata and Formal Languages. Praha: Matfyzpress, 2012. pp. 1-2. ISBN 978-80-7378-221-4.

Type

Invited/Awarded proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Stringology is a computer science on string and sequence processing. Finite automata can solve efficiently many tasks in stringology. An overview of four approaches of the finite automata use in stringology is presented: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. The way the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables) is also shown. Those approaches and complex alphabet use are demonstrated in various tasks of bioinformatics. The tasks work with sequences on various alphabets: DNA, RNA, amino acids. The searched patterns may be explicitly defined or may be defined by their properties. Some patterns may allow errors some must match exactly. Some pattern symbols may match exactly one symbol, some may match a subset of alphabet as they were not read exactly or their practical behaviour is the same.

The Finite Automata Approaches in Stringology

Authors

Holub, J.

Year

2012

Published

Kybernetika. 2012, 48(3), 386-401. ISSN 0023-5954.

Type

Article

Departments

Department of Theoretical Computer Science

Annotation

We present an overview of four approaches of the finite automata use in stringology: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. We also show how the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables).

Block-oriented Dense Compressor

Authors

Procházka, P.; Holub, J.

Year

2011

Published

Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 372. ISSN 1068-0314. ISBN 978-0-7695-4352-9.

Type

Proceedings paper

DOI

10.1109/DCC.2011.76

Departments

Department of Theoretical Computer Science

Annotation

We address the problem of block-oriented natural language compression. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on Dense Code idea. It achieves very good compression ratio around 32 % on natural language text. We show that our compression method has some interesting properties which could be applied in digital libraries and other textual databases. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with corresponding vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.

Finite Automata in Pattern Matching

Authors

Holub, J.

Year

2011

Published

Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications. New York: J. Willey, 2011. p. 51-71. ISBN 978-0-470-50519-9.

Type

Book chapter

Departments

Department of Theoretical Computer Science

Annotation

Stringology is a part of Computer Science dealing with processing strings and sequences. It finds many important applications in various fields used by Information Society. Biological sequences processing is one of the most important fields. On the other hand the finite automata theory is a well developed formal system used for a long time in the area of compiler construction. The chapter aims to show various approaches of the finite automata use in Stringology. The approaches are demonstrated on practical examples. Of course it is impossible to describe all the approaches as it would be out of scope of the chapter. However, we would like to present basic approaches that the reader can modify and combine to a given task.

Lossless Data Compression Testbed: ExCom and Prague Corpus

Authors

Holub, J.; Řezníček, Jakub

Year

2011

Published

Proceedings Data Compression Conference. Los Alamitos, CA: IEEE Computer Soc., 2011. pp. 457. ISSN 1068-0314. ISBN 978-0-7695-4352-9.

Type

Proceedings paper

DOI

10.1109/DCC.2011.61

Departments

Department of Theoretical Computer Science

Annotation

Authors developing new algorithms have to compare their implementations with the implementations of the other algorithms. They can do a testing on standard corpora and compare the achieved compression ratio with published compression ratios made on the same corpus. The corpora are static collections of files so they are getting older and do not contain new file types that appear. Moreover one can compare only the compression ratio. The times of compression and decompression are very important as well. In order to compare the times one have to run the implementation on the same platform and the same hardware. So one have to ask the authors for the source codes and incorporate them into his system. Then the tests could be run on the same machine.

Natural Language Compression per Blocks

Authors

Procházka, P.; Holub, J.

Year

2011

Published

First International Conference on Data Compression, Communications and Processing. Los Alamitos: IEEE Computer Society, 2011. pp. 67-75. ISBN 978-0-7695-4528-8.

Type

Proceedings paper

DOI

10.1109/CCP.2011.25

Departments

Department of Theoretical Computer Science

Annotation

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed is the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture.

An algorithm for mapping short reads to a dynamically changing genomic sequence

Authors

Flouri, T.; Holub, J.; Iliopoulos, C.S.I.; Pissis, S.P.P.

Year

2010

Published

2010 IEEE International Conference on Bioinformatics and Biomedicine. Los Alamitos: IEEE Computer Society, 2010. p. 133-136. ISBN 978-1-4244-8305-1.

Type

Proceedings paper

DOI

10.1109/BIBM.2010.5706551

Departments

Department of Theoretical Computer Science

Annotation

The constant advances in sequencing technology have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these reads to reference sequences are in great demand, and recently, there has been ample work for publishing such algorithms. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be applied to address the problem of mapping millions of reads to multiple genomic sequences.

Data compression of Natural languages

Authors

Jaroš, J.; Holub, J.

Year

2010

Published

Workshop 2010. Praha: České vysoké učení technické v Praze, 2010. pp. 100-101. CTU Reports. ISBN 978-80-01-04513-8.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In this contribution we present the new data compression schema for natural languages compression. We use two-part compression model in this schema: the static part of the model and the adaptive part of the model. The adaptive part of model is similar to the common adaptive model. The best-suited compression algorithms for the adaptive part with respect to the static part of the model are very efficient PPM algorithms. The static part of the model can be represented by many possibilities.

Improving Practical Exact String Matching

Authors

Ďurian, B.; Holub, J.; Peltola, H.; Tarhio, J.

Year

2010

Published

Information Processing Letters. 2010, 110(4), 148-152. ISSN 0020-0190.

Type

Article

DOI

10.1016/j.ipl.2009.11.010

Departments

Department of Theoretical Computer Science

Annotation

We present improved variations of the BNDM algorithm for exact string matching. At each alignment the algorithms read a $q$-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction.

Semi-static Word-based Natural Language Compression

Authors

Jaroš, J.; Holub, J.

Year

2010

Published

Proceedings of the 11th International PhD Workshop on Systems and Control A Young Generation Viewpoint. Veszprém: University of Pannonia, 2010. pp. 57-62. ISBN 978-615-5044-00-7.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The main technique of reducing loads of the space needed to store the data or reducing the time needed to transmit the data is the data compression. The novel semi-static word-based data compression schema for natural languages compression is presented in this paper. The two-part compression model (the static part of the model and the adaptive part of the model) is used in this schema. The compression ratio it achieves is the same as the compression ratio of used specific adaptive data compression algorithms in the worst case.

A Different Proof of the Crochemore-Ilie Lemma Concerning Microruns

Authors

Franek, F.; Holub, J.

Year

2009

Published

London Algorithmics 2008: Theory and Practice. London: College Publications, 2009. pp. 1-9. ISBN 978-1-904987-97-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In a resent paper, Crochemore and Ilie improved the upper bound of the maxrun function ro(n)=max{r(x) : |x| = n}, where r(x) denotes the number of runs in a string x, to 1.6n (computational results on Ilie's web page claim the improvement to 1.048n.) Their proof is essentially in two major steps: first step shows that there is at most 1center of a d-run on average in each interval of length d, in the second step they showed that ro9(n) <= n, where ro9(n) is the count limited to runs of size <= 9. The proof of ro9(n) <= n they presented relies on computation. We present an alternative simpler proof, also computational, to provide an independent verification of Crochemore-Ilie's result. The method used is completely different and has a potential to be verified without an aid of computer.

New Word-based Adaptive Dense Compressors

Authors

Procházka, P.; Holub, J.

Year

2009

Published

Combinatorial Algorithms. Berlin: Springer, 2009. pp. 420-431. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-10216-5.

Type

Proceedings paper

DOI

10.1007/978-3-642-10217-2_41

Departments

Department of Theoretical Computer Science

Annotation

In the last two decades the natural language compression made a great progress. The main step in this evolution was the introduction of word-based compression by Mofiat. The word-based statistical compression algorithms are able to achieve 35% improvement in the compression ratio in comparison with character-based ones. We present two new word-based statistical compression algorithms based on dense coding idea: Two Byte Dense Code (TBDC) and Self-Tuning Dense Code (SCDC). TBDC uses the codewords with maximal size 2 bytes and must be implemented with some pruning technique. STDC is able to tune its code space during the compression process and so achieve better compression. Our algorithms improve the compression ratio and are considerate to smaller ffles which are very often omitted. We present also a generalized concept of dense coding called Open Dense Code (ODC) which provides a frame for deffnition of these two and many other dense code schemas.

NFA Simulation Using Deterministic State Cache

Authors

Holub, J.; Kadlec, T.

Year

2009

Published

London Algorithmics 2008: Theory and Practice. London: College Publications, 2009. pp. 152-166. ISBN 978-1-904987-97-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science
Department of Software Engineering

Annotation

The nondeterministic finite automaton (NFA) cannot be directly used because of its nondeterminism. There are two ways to use it: Firstly, transform it to the equivalent deterministic finite automaton, or secondly, simulate the run of NFA in a deterministic way. We present a deterministic state cache method that combines these two approaches. It uses the fast memory of already used deterministic states, which can under certain circumstances dramatically accelerate the basic simulation method, while we can control the amount of memory used. We present an implementation based on a hash array.

On Parallel Implementations of Deterministic Finite Automata

Authors

Holub, J.; Štekr, S.

Year

2009

Published

Implementation and Application of Automata. Heidelberg: Springer, 2009. p. 54-64. ISSN 0302-9743. ISBN 978-3-642-02978-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We present implementations of parallel DFA run methods and find whether and under what conditions is worthy to use the parallel methods of simulation of run of finite automata. First, we introduce the parallel DFA run methods for general DFA, which are universal, but due to the dependency of simulation time on the number of states vertical bar Q vertical bar of automaton being run, they are suitable only for run of automata with the smaller number of states. Then we show that if we apply some restrictions to properties of automata being run, we can reach the linear speedup compared to the sequential simulation method. We designed methods benefiting from k-locality that allows optimum parallel run of exact and approximate pattern matching automata. Finally, we show the results of experiments conducted on two types of parallel computers (Cluster of workstations and Symmetric shared-memory multiprocessors).

Algorithms on Indeterminate Strings

Authors

Holub, J.; Smyth, W.F.

Year

2003

Published

Proceedings of the 14th Australian Workshop on Combinatorial Algorithms AWOCA'03. Seoul: Seoul National University, 2003, pp. 36-45. ISBN 89-7282-326-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

A recent paper describes an average-case linear-time algorithm for the calculation of all the borders of every prefix of a string x, where x can contain ''don't-care'' letters that match with any letter of the alphabet. In this paper we show that in fact two distinct cases arise for the border calculation - we call them ''quantum'' and ''deterministic'', and both of them can arise in practice. The existing algorithm applies to the quantum case, but we show how to modify it to deal with the deterministic case. We then extend the border algorithms to apply also to what we call ''indeterminate strings''. We also discuss other algorithms on these indeterminate strings, some of them left as open problems.