Aktuální informace FIT ke koronaviru najdete zde.

Ing. Jan Motl

Publikace

Learning on a Stream of Features with Random Forest

Rok
2019
Publikováno
Proceedings of the 19th Conference Information Technologies - Applications and Theory (ITAT 2019). Aachen: CEUR Workshop Proceedings, 2019. p. 79-83. ISSN 1613-0073.
Typ
Stať ve sborníku
Anotace
We study an interesting and challenging problem, supervised learning on a stream of features, in which the size of the feature set is unknown, and not all features are available for learning while leaving the number of observations constant. In this problem, the features arrive one at a time, and the learner’s task is to train a model equivalent to a model trained from "scratch". When a new feature is inserted into the training set, a new set of trees is trained and added into the current forest. However, it is desirable to correct the selection bias: older features has more opportunities to get selected into trees than the new features. We combat the selection bias by adjusting the feature selection distribution. However, while this correction works well, it may require training of many new trees. In order to keep the count of the new trees small, we furthermore put more weight on more recent trees than on the old trees.

An atlas of chromatin accessibility in the adult human brain

Autoři
Motl, J.; Fullard, J.F.; Hauberg, M.E.; Bendl, J.; Egervari, G.; Cirnaru, M.D.; Reach, S.M.; Ehrlich, M.E.; Hurd, Y.L.; Roussos, P.
Rok
2018
Publikováno
Genome Research. 2018, 28(8), 1243-1252. ISSN 1088-9051.
Typ
Článek
Anotace
Most common genetic risk variants associated with neuropsychiatric disease are noncoding and are thought to exert their effects by disrupting the function of cis regulatory elements (CREs), including promoters and enhancers. Within each cell, chromatin is arranged in specific patterns to expose the repertoire of CREs required for optimal spatiotemporal regulation of gene expression. To further understand the complex mechanisms that modulate transcription in the brain, we used frozen postmortem samples to generate the largest human brain and cell-type-specific open chromatin data set to date. Using the Assay for Transposase Accessible Chromatin followed by sequencing (ATAC-seq), we created maps of chromatin accessibility in two cell types (neurons and non-neurons) across 14 distinct brain regions of five individuals. Chromatin structure varies markedly by cell type, with neuronal chromatin displaying higher regional variability than that of non-neurons. Among our findings is an open chromatin region (OCR) specific to neurons of the striatum. When placed in the mouse, a human sequence derived from this OCR recapitulates the cell type and regional expression pattern predicted by our ATAC-seq experiments. Furthermore, differentially accessible chromatin overlaps with the genetic architecture of neuropsychiatric traits and identifies differences in molecular pathways and biological functions. By leveraging transcription factor binding analysis, we identify protein-coding and long noncoding RNAs (lncRNAs) with cell-type and brain region specificity. Our data provide a valuable resource to the research community and we provide this human brain chromatin accessibility atlas as an online database "Brain Open Chromatin Atlas (BOCA)" to facilitate interpretation.

Do We Need to Observe Features to Perform Feature Selection?

Rok
2018
Publikováno
Proceedings of the 18th Conference Information Technologies - Applications and Theory (ITAT 2018). Aachen: CEUR Workshop Proceedings, 2018. p. 44-51. vol. 2203. ISSN 1613-0073. ISBN 9781727267198.
Typ
Stať ve sborníku
Anotace
Many feature selection methods were developed in the past, but in the core, they all work the same way — you pass a set of features to the algorithm and get a reduced set of the features. But can we perform a non-trivial feature selection without first observing the features? This is an important question because if we were actually able to predict feature importance before observing the features, we would reduce computation requirements of all stages of machine learning process beginning with feature engineering. In this article, we argue that it is possible to predict feature importance before feature vector observation. The trick is that we use meta-features about the features to perform the feature selection. We evaluate the concept on 15 relational databases. On average, it was enough to generate the top decile of all features to get the same model accuracy as if we generated all features and passed them to the model.

Violation of Independence of Irrelevant Alternatives in Friedman’s test

Rok
2018
Publikováno
Proceedings of the 18th Conference Information Technologies - Applications and Theory (ITAT 2018). Aachen: CEUR Workshop Proceedings, 2018. p. 59-63. vol. 2203. ISSN 1613-0073. ISBN 9781727267198.
Typ
Stať ve sborníku
Anotace
One of the most common methods for classifier comparison is Friedman’s test. However, Friedman’s test has a known flaw — ranking of classifiers A and B does not depend only on the properties of classifiers A and B, but also on the properties of all other evaluated classifiers. We illustrate the issue on a question: “What is better, bagging or boosting?”. With Friedman’s test, the answer depends on the presence/absence of irrelevant classifiers in the experiment. Based on the application of Friedman’s test on an experiment with 179 classifiers and 121 datasets we conclude that it is very easy to game the ranking of two insignificantly different classifiers. But once the difference becomes significant, it is unlikely that by removing irrelevant classifiers we obtain significant results with reversed ranking.

Foreign Key Constraint Identification in Relational Databases

Rok
2017
Publikováno
ITAT 2017: Information Technologies – Applications and Theory. Aachen: CEUR Workshop Proceedings, 2017. p. 106-111. vol. 1885. ISSN 1613-0073.
Typ
Stať ve sborníku
Anotace
For relational learning, it is important to know the relationships between the tables. In relational databases, the relationships can be described with foreign key constraints. However, the foreign keys may not be explicitly specified. In this article, we present how to automatically and quickly identify primary & foreign key constraints from metadata about the data. Our method was evaluated on 72 databases and has F-measure of 0.87 for foreign key constraint identification. The proposed method significantly outperforms in runtime related methods reported in the literature and is database vendor agnostic.

What Makes a Fairy Tale

Autoři
Motl, J.; Nie, W.
Rok
2017
Publikováno
International Journal of Computational Linguistics and Applications. 2017, 8(1), ISSN 0976-0962.
Typ
Článek
Anotace
Traditionally, fairy tales were analyzed by their plots, how- ever, this approach was criticized since it omits tone, mood, character and other attributes that further differentiates one fairy tale from an- other. To find characteristic style of fairy tales written in English, factor analysis was applied on the extracted adjectives. The analysis gave rise to five unique factors describing characteristic style of fairy tales.

Benchmarking Classifier Performance with Sparse Measurements

Autoři
Rok
2015
Publikováno
ITAT 2015 conference proceedings. Aachen: CEUR Workshop Proceedings, 2015. p. 172-178. ISSN 1613-0073. ISBN 978-1-5151-2065-0.
Typ
Stať ve sborníku
Anotace
The presented paper describes a methodology, how to perform benchmarking, when classifier performance measurements are sparse. The described methodology is based on missing value imputation and was demonstrated to work, even when 80% of measurements are missing, for example because of unavailable algorithm implementations or unavailable datasets. The methodology was then applied on 29 relational classifiers & propositional tools and 15 datasets, making it the biggest meta- analysis in relational classification up to date.