Laboratoř datových věd

V laboratoři se věnujeme datovým vědám v celé jejich šíři: od zpracování a vizualizace dat až po aplikaci strojového učení a umělé inteligence. Máme výsledky v základním výzkumu, ale i ve výzkumu aplikovaném, který si u nás zadávají firmy.

Více o nás

Čemu se laboratoř věnuje?

Publikace

Deep Bayesian Semi-Supervised Active Learning for Sequence Labelling

Autoři
Šabata, T.; Holeňa, M.; Páll, J.E.
Rok
2019
Publikováno
Proceedings of the Workshop on Interactive Adaptive Learning co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019). CEUR Workshop Proceedings, 2019. p. 80-95. vol. 2444. ISSN 1613-0073.
Typ
Stať ve sborníku
Anotace
In recent years, deep learning has shown supreme results in many sequence labelling tasks, especially in natural language processing. However, it typically requires a large training data set compared with statistical approaches. In areas where collecting of unlabelled data is cheap but labelling expensive, active learning can bring considerable improvement. Sequence learning algorithms require a series of token-level labels for a whole sequence to be available during the training process. Annotators of sequences typically label easily predictable parts of the sequence although such parts could be labelled automatically instead. In this paper, we introduce a combination of active and semi-supervised learning for sequence labelling. Our approach utilizes an approximation of Bayesian inference for neural nets using Monte Carlo dropout. The approximation yields a measure of uncertainty that is needed in many active learning query strategies. We propose Monte Carlo token entropy and Monte Carlo N-best sequence entropy strategies. Furthermore, we use semi-supervised pseudo-labelling to reduce labelling effort. The approach was experimentally evaluated on multiple sequence labelling tasks. The proposed query strategies outperform other existing techniques for deep neural nets. Moreover, the semi-supervised learning reduced the labelling effort by almost 80% without any incorrectly labelled samples being inserted into the training data set.

Constructing a Data Visualization Recommender System

Autoři
Friedjungová, M.; Kubernátová, P.; van Duijn, M.
Rok
2019
Publikováno
Data Management Technologies and Applications. Cham: Springer International Publishing, 2019. p. 1-25. vol. 862. ISSN 1865-0937. ISBN 978-3-030-26636-3.
Typ
Stať ve sborníku
Anotace
Choosing a suitable visualization for data is a difficult task. Current data visualization recommender systems exist to aid in choosing a visualization, yet suffer from issues such as low accessibility and indecisiveness. In this study, we first define a step-by-step guide on how to build a data visualization recommender system. We then use this guide to create a model for a data visualization recommender system for non-experts that aims to resolve the issues of current solutions. The result is a question-based model that uses a decision tree and a data visualization classification hierarchy in order to recommend a visualization. Furthermore, it incorporates both task-driven and data characteristics-driven perspectives, whereas existing solutions seem to either convolute these or focus on one of the two exclusively. Based on testing against existing solutions, it is shown that the new model reaches similar results while being simpler, clearer, more versatile, extendable and transparent. The presented guide can be used as a manual for anyone building a data visualization recommender system. The resulting model can be applied in the development of new data visualization software or as part of a learning tool.

Všechny publikace

Online Abstraction with MDP Homomorphisms for Deep Learning

Autoři
Bíža, O.; Platt, R.
Rok
2019
Publikováno
Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. New York: ACM, 2019. p. 1125-1133. ISSN 2523-5699. ISBN 978-1-4503-6309-9.
Typ
Stať ve sborníku
Anotace
Abstraction of Markov Decision Processes is a useful tool for solving complex problems, as it can ignore unimportant aspects of an environment, simplifying the process of learning an optimal policy. In this paper, we propose a new algorithm for finding abstractMDPs in environments with continuous state spaces. It is based on MDP homomorphisms, a structure-preserving mapping betweenMDPs. We demonstrate our algorithm’s ability to learn abstractions from collected experience and show how to reuse the abstractions to guide exploration in new tasks the agent encounters. Our novel task transfer method outperforms baselines based on a deep Q-network in the majority of our experiments. The source code is at https://github.com/ondrejba/aamas_19.

Missing Features Reconstruction and Its Impact on Classification Accuracy

Rok
2019
Publikováno
Computational Science – ICCS 2019. Springer, Cham, 2019. p. 207-220. vol. 11538. ISBN 978-3-030-22744-9.
Typ
Stať ve sborníku
Anotace
In real-world applications, we can encounter situations when a well-trained model has to be used to predict from a damaged dataset. The damage caused by missing or corrupted values can be either on the level of individual instances or on the level of entire features. Both situations have a negative impact on the usability of the model on such a dataset. This paper focuses on the scenario where entire features are missing which can be understood as a specific case of transfer learning. Our aim is to experimentally research the influence of various imputation methods on the performance of several classification models. The imputation impact is researched on a combination of traditional methods such as k-NN, linear regression, and MICE compared to modern imputation methods such as multi-layer perceptron (MLP) and gradient boosted trees (XGBT). For linear regression, MLP, and XGBT we also propose two approaches to using them for multiple features imputation. The experiments were performed on both real world and artificial datasets with continuous features where different numbers of features, varying from one feature to 50%, were missing. The results show that MICE and linear regression are generally good imputers regardless of the conditions. On the other hand, the performance of MLP and XGBT is strongly dataset dependent. Their performance is the best in some cases, but more often they perform worse than MICE or linear regression.

Chameleon 2: An Improved Graph-Based Clustering Algorithm

Autoři
Kordík, P.; Bartoň, T.; Brůna, T.
Rok
2019
Publikováno
ACM Transactions on Knowledge Discovery from Data. 2019, 13(1), ISSN 1556-4681.
Typ
Článek
Anotace
Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions, or in the presence of noise. We propose an improved graph-based clustering algorithm called Chameleon 2, which overcomes several drawbacks of state-of-the-art clustering approaches. We modified the internal cluster quality measure and added an extra step to ensure algorithm robustness. Our results reveal a significant positive impact on the clustering quality measured by Normalized Mutual Information on 32 artificial datasets used in the clustering literature. This significant improvement is also confirmed on real-world datasets. The performance of clustering algorithms such as DBSCAN is extremely parameter sensitive, and exhaustive manual parameter tuning is necessary to obtain a meaningful result. All hierarchical clustering methods are very sensitive to cutoff selection, and a human expert is often required to find the true cutoff for each clustering result. We present an automated cutoff selection method that enables the Chameleon 2 algorithm to generate high-quality clustering in autonomous mode.

Comparing Offline and Online Evaluation Results of Recommender Systems

Autoři
Kordík, P.; Podsztavek, O.; Řehořek, T.; Bartyzal, R.; Bíža, O.; Povalyev, I.P.
Rok
2018
Publikováno
REVEAL RecSyS 2018 workshop proceedings. New York: ACM, 2018.
Typ
Stať ve sborníku
Anotace
Recommender systems are usually trained and evaluated on historical data. Offline evaluation is, however, tricky and offline performance can be an inaccurate predictor of the online performance measured in production due to several reasons. In this paper, we experiment with two offline evaluation strategies and show that even a reasonable and popular strategy can produce results that are not just biased, but also in direct conflict with the true performance obtained in the online evaluation. We investigate offline policy evaluation techniques adapted from reinforcement learning and explain why such techniques fail to produce an unbiased estimate of the online performance in the “watch next” scenario of a large-scale movie recommender system. Finally, we introduce a new evaluation technique based on Jaccard Index and show that it correlates with the online performance.

Discovering predictive ensembles for transfer learning and meta-learning

Autoři
Frýda, T.; Kordík, P.; Černý, J.
Rok
2018
Publikováno
Machine Learning. 2018, 107(1), 177-207. ISSN 0885-6125.
Typ
Článek
Anotace
Recent meta-learning approaches are oriented towards algorithm selection, optimization or recommendation of existing algorithms. In this article we show how data-tailored algorithms can be constructed from building blocks on small data sub-samples. Building blocks, typically weak learners, are optimized and evolved into data-tailored hierarchical ensembles. Good-performing algorithms discovered by evolutionary algorithm can be reused on data sets of comparable complexity. Furthermore, these algorithms can be scaled up to model large data sets. We demonstrate how one particular template (simple ensemble of fast sigmoidal regression models) outperforms state-of-the-art approaches on the Airline data set. Evolved hierarchical ensembles can therefore be beneficial as algorithmic building blocks in meta-learning, including meta-learning at scale.

Kde nás najdete?

Laboratoř datových věd
Katedra aplikované matematiky
Fakulta informačních technologií
České vysoké učení technické v Praze

Místnost 1347 (Budova A, 13. patro)
Thákurova 7
Praha 6 – Dejvice
160 00

Kontaktní osoba

Ing. Karel Klouda, Ph.D.

Vedoucí Katedry aplikované matematiky

Za obsah stránky zodpovídá: doc. Ing. Štěpán Starosta, Ph.D.