Ing. Ondřej Guth, Ph.D.

Inexact tree pattern matching with 1-degree edit distance using finite automata

Authors

Šestáková, E.; Guth, O.; Janoušek, J.

Year

2023

Published

Discrete Applied Mathematics. 2023, 330 78-97. ISSN 0166-218X.

Type

Article

DOI

10.1016/j.dam.2023.01.003

Departments

Department of Theoretical Computer Science

Annotation

Given an input tree and a tree pattern, the inexact tree pattern matching problem is about finding all subtrees in the input tree that match the tree pattern with up to errors. To measure the number of errors between two labeled ordered trees, we use the 1-degree edit distance that uses operations node relabel, leaf insert, and leaf delete. The proposed solution is based on a finite automaton that reads a given input tree represented in linear, prefix bar, notation. First, we show its construction for a constrained variant of 1-degree edit distance where leaf insert/delete operations cannot be applied to the tree pattern recursively. Then, we extend this approach to both unit cost and non-unit cost 1-degree edit distance. The deterministic version of the proposed finite automaton finds all inexact occurrences of the tree pattern in time linear to the size of the input tree. However, since the size of such automaton can be exponential in the number of nodes of the tree pattern, it is practical only for small patterns. Therefore, we also consider the dynamic programming approach as a way of simulating the nondeterministic finite automaton. This approach comes with the space complexity and time complexity where is the number of nodes of the tree pattern, is the number of nodes of the input tree, and is the maximum number of errors allowed.

On expressive power of regular expressions with subroutine calls and lookaround assertions

Authors

Guth, O.

Year

2023

Published

Proceedings of the Prague Stringology Conference 2023. Praha: CESKE VYSOKE UCENI TECHNICKE V PRAZE, 2023. p. 68-82. ISBN 978-80-01-07206-6.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Many regular expression engines employ syntactical extensions to provide simple, expressive support for real-world needs. These features are subroutine calls, zero-width lookaround assertions, DEFINE rules, and named parenthesised expressions. A subroutine call executes a specified subpattern where the call is placed, possibly recursively. Lookaround assertions are either lookahead or lookbehind: a lookahead is a conditional within a subpattern: when it fails, the match at the current position of the whole subpattern fails, while a lookahead itself does not consume any input; a lookbehind works as a lookahead except it checks the input prior to the current position. A DEFINE rule introduces a subpattern for use by a subroutine call, while not involved in matching where the rule is placed. A named parenthesised expression can be executed by its name in addition to the parenthesis number. This paper presents a formalisation of subroutine calls, DEFINE rules, and named parenthesised expressions using the matching relation while attempting to mimic the behaviour of real-world regular expression engines. Also, we give an alternative constructive proof of equivalence of expressive power of regular expressions extended with subroutine calls and the class of context-free languages: a conversion between such expressions and context-free grammars. Finally, the question of whether regular expressions with operations lookaround assertion combined with subroutine call have greater expressive power than expressions with only subroutine call is answered positively.

Automata Approach to Inexact Tree Pattern Matching Using 1-degree Edit Distance

Authors

Šestáková, E.; Guth, O.; Janoušek, J.

Year

2021

Published

Proceedings of the Prague Stringology Conference 2021. Praha: CESKE VYSOKE UCENI TECHNICKE V PRAZE, 2021. p. 1-15. ISBN 978-80-01-06869-4.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

We compare labeled ordered trees based on unit cost 1-degree edit distance that uses operations vertex relabeling, leaf insertion, and leaf deletion. Given an input tree T and a tree pattern P, we find all subtrees in T that match P with up to k errors. We show that this problem can be solved by finite automaton when T and P are represented in linear, prefix bar, notation. First, we solve this problem by a pushdown automaton. Then, we show that it can be transformed into a nondeterministic finite automaton due to its restricted use of the pushdown store. We also show a simulation of the nondeterministic finite automaton by dynamic programming.

Co-Teaching Computer Science Across Borders: Human-Centric Learning at Scale

Authors

Piech, C.; Yan, L.; Einstein, L.; Saavedra, A.; Bozkurt, B.; Šestáková, E.; Guth, O.; McKeown, N.

Year

2020

Published

L@S '20: Proceedings of the Seventh ACM Conference on Learning @ Scale. New York: Association for Computing Machinery, 2020. p. 103-113. ISBN 978-1-4503-7951-9.

Type

Proceedings paper

DOI

10.1145/3386527.3405915

Departments

Department of Theoretical Computer Science

Annotation

Programming is fast becoming a required skill set for students in every country. We present CS Bridge, a model for cross-border co-teaching of CS1, along with a corresponding open-source course-in-a-box curriculum made for easy localization. In the CS Bridge model, instructors and student-teachers from different countries come together to teach a short, stand-alone CS1 course to hundreds of local high school students. The corresponding open-source curriculum has been specifically designed to be easily adapted to a wide variety of local teaching practices, languages, and cultures. Over the past six years, the curriculum has been used to teach CS1 material to over 1,000 high school students in Colombia, the Czech Republic, Turkey, and Guinea. A large majority of our students continue on to study CS or CS-related fields in university. More importantly, many of our undergraduate student-teachers stay involved with teaching beyond the program. Joint teaching creates a positive, high-quality learning experience for students around the world and a powerful, high-impact professional development experience for the teaching team---instructors and student-teachers alike.

On approximate enhanced covers under Hamming distance

Authors

Guth, O.

Year

2020

Published

Discrete Applied Mathematics. 2020, 274 67-80. ISSN 0166-218X.

Type

Article

DOI

10.1016/j.dam.2019.01.015

Departments

Department of Theoretical Computer Science

Annotation

A border p of a string x is an enhanced cover of x if the number of positions of x that lie within some occurrence of p is the maximum among all borders of x. (String p is a border of x if p is both a proper prefix and a suffix of x.) In this paper, two more general notions based on enhanced covers are introduced: a k-approximate enhanced cover and a relaxed k-approximate enhanced cover, where a fixed maximum number of errors k under the Hamming distance is considered. The k-approximate enhanced cover of x is its border, and its k-approximate occurrences are also considered in the covered number of positions of x. The relaxed k-approximate enhanced cover of x is a factor of x and a k-approximate border of x. Algorithms that compute all the variations of k-approximate enhanced covers mentioned above are presented in this paper.

Computing All Approximate Enhanced Covers with the Hamming Distance

Authors

Guth, O.

Year

2016

Published

Proceedings of the Prague Stringology Conference 2016. Praha: Czech Technical University in Prague, 2016. p. 146-157. ISBN 978-80-01-05996-8.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

A border p of a string x is an enhanced cover of x if the number of positions of x that lie within some occurrence of p is the maximum among all borders of x. In this paper, more general notion based on the enhanced cover is introduced: a k-approximate enhanced cover, where fixed maximum number of errors k in the Hamming distance is considered. The k-approximate enhanced cover of x is its border and its k-approximate occurrences are also considered in the covered number of positions of x. An O(n^2)-time and a O(n)-space algorithm that computes all k-approximate enhanced covers of a string of length n is presented.

On left and right seeds of a string

Authors

Christou, M.; Crochemore, M.; Guth, O.; Iliopoulos, C.S.; Pissis, S.P.

Year

2012

Published

Journal of Discrete Algorithms. 2012, 17 31-44. ISSN 1570-8667.

Type

Article

DOI

10.1016/j.jda.2012.10.004

Departments

Department of Theoretical Computer Science

Annotation

We consider the problem of finding the repetitive structure of a given string y of length n. A factor u of y is a cover of y, if every letter of y lies within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. A left seed of y is a prefix of y, that is a cover of a superstring of y. Similarly, a right seed of y is a suffix of y, that is a cover of a superstring of y. An integer array LS is the minimal left-seed (resp. maximal left-seed) array of y, if LS[i] is the minimal (resp. maximal) length of left seeds of y[0..i]. The minimal right-seed (resp. maximal right-seed) arrayRS of y is defined in a similar fashion. In this article, we present linear-time algorithms for computing all left and right seeds of y, a linear-time algorithm for computing the minimal left-seed array of y, a linear-time solution for computing the maximal left-seed array of y, an O(n log n)-time algorithm for computing the minimal right-seed array of y, and a linear-time solution for computing the maximal right-seed array of y. All algorithms use linear auxiliary space.

On the Right-Seed Array of a String

Authors

Christou, M.; Crochemore, M.; Guth, O.; Iliopoulos, CS; Pissis, SP

Year

2011

Published

Proceedings of the 17th Annual International Computing and Combinatorics Conference (COCOON 2011). Berlin: Springer-Verlag, 2011. p. 492-502. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-22684-7.

Type

Proceedings paper

DOI

10.1007/978-3-642-22685-4_43

Departments

Department of Theoretical Computer Science

Annotation

We consider the problem of finding the repetitive structure of a given fixed string y. A factor u of y is a cover of y, if every letter of y falls within some occurrence of u in y. A factor v of y is a seed of y, if it is a cover of a superstring of y. There exist linear-time algorithms for solving the minimal cover problem. The minimal seed problem is of much higher algorithmic difficulty, and no linear-time algorithm is known. In this article, we solve one of its variants - computing the minimal and maximal right-seed array of a given string. A right seed of y is the shortest suffix of y that it is a cover of a superstring of y. An integer array RS is the minimal right-seed (resp. maximal right-seed) array of y, if RS [i] is the minimal (resp. maximal) length of right seeds of y[0..i]. We present an O(n log n) time algorithm that computes the minimal right-seed array of a given string, and a linear-time solution to compute the maximal right-seed array.

Using Finite Automata Approach for Searching Approximate Seeds of Strings

Authors

Guth, O.; Melichar, B.

Year

2010

Published

Intelligent Automation and Computer Engineering. New York: Springer, 2010. p. 347-360. ISSN 1876-1100. ISBN 978-90-481-3516-5.

Type

Book chapter

DOI

10.1007/978-90-481-3517-2_27

Departments

Department of Theoretical Computer Science

Annotation

Seed is a type of a regularity of strings. A restricted approximate seed w of string T is a factor of T such that w covers a superstring of T under some distance rule. In this paper, the problem of searching of all restricted seeds with the smallest Hamming distance is studied and a polynomial time and space algorithm for solving the problem is presented. It searches for all restricted approximate seeds of a string with given limited approximation using Hamming distance and it computes the smallest distance for each found seed. The solution is based on a finite (suffix) automata approach that provides a straightforward way to design algorithms to many problems in stringology. Therefore, it is shown that the set of problems solvable using finite automata includes the one studied in this paper.

Ing. Ondřej Guth, Ph.D.

Publications

Inexact tree pattern matching with 1-degree edit distance using finite automata

On expressive power of regular expressions with subroutine calls and lookaround assertions

Automata Approach to Inexact Tree Pattern Matching Using 1-degree Edit Distance

Co-Teaching Computer Science Across Borders: Human-Centric Learning at Scale

On approximate enhanced covers under Hamming distance

Computing All Approximate Enhanced Covers with the Hamming Distance

On left and right seeds of a string

On the Right-Seed Array of a String

Using Finite Automata Approach for Searching Approximate Seeds of Strings