Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

Rok
2020
Publikováno
Computational Science - ICCS 2020. Cham: Springer, 2020. p. 225-239. vol. 12140. ISSN 1611-3349. ISBN 978-3-030-50423-6.
Typ
Stať ve sborníku
Anotace
Missing data is one of the most common preprocessing problems. In this paper, we experimentally research the use of generative and non-generative models for feature reconstruction. Variational Autoencoder with Arbitrary Conditioning (VAEAC) and Generative Adversarial Imputation Network (GAIN) were researched as representatives of generative models, while the denoising autoencoder (DAE) represented non-generative models. Performance of the models is compared to traditional methods k-nearest neighbors (k-NN) and Multiple Imputation by Chained Equations (MICE). Moreover, we introduce WGAIN as the Wasserstein modification of GAIN, which turns out to be the best imputation model when the degree of missingness is less than or equal to 30%. Experiments were performed on real-world and artificial datasets with continuous features where different percentages of features, varying from 10% to 50%, were missing. Evaluation of algorithms was done by measuring the accuracy of the classification model previously trained on the uncorrupted dataset. The results show that GAIN and especially WGAIN are the best imputers regardless of the conditions. In general, they outperform or are comparative to MICE, k-NN, DAE, and VAEAC.

Deep Bayesian Semi-Supervised Active Learning for Sequence Labelling

Autoři
Šabata, T.; Páll, J.E.; Holeňa, M.
Rok
2019
Publikováno
Proceedings of the Workshop on Interactive Adaptive Learning co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019). CEUR Workshop Proceedings, 2019. p. 80-95. vol. 2444. ISSN 1613-0073.
Typ
Stať ve sborníku
Anotace
In recent years, deep learning has shown supreme results in many sequence labelling tasks, especially in natural language processing. However, it typically requires a large training data set compared with statistical approaches. In areas where collecting of unlabelled data is cheap but labelling expensive, active learning can bring considerable improvement. Sequence learning algorithms require a series of token-level labels for a whole sequence to be available during the training process. Annotators of sequences typically label easily predictable parts of the sequence although such parts could be labelled automatically instead. In this paper, we introduce a combination of active and semi-supervised learning for sequence labelling. Our approach utilizes an approximation of Bayesian inference for neural nets using Monte Carlo dropout. The approximation yields a measure of uncertainty that is needed in many active learning query strategies. We propose Monte Carlo token entropy and Monte Carlo N-best sequence entropy strategies. Furthermore, we use semi-supervised pseudo-labelling to reduce labelling effort. The approach was experimentally evaluated on multiple sequence labelling tasks. The proposed query strategies outperform other existing techniques for deep neural nets. Moreover, the semi-supervised learning reduced the labelling effort by almost 80% without any incorrectly labelled samples being inserted into the training data set.

Constructing a Data Visualization Recommender System

Autoři
Friedjungová, M.; Kubernátová, P.; van Duijn, M.
Rok
2019
Publikováno
Data Management Technologies and Applications. Cham: Springer International Publishing, 2019. p. 1-25. vol. 862. ISSN 1865-0937. ISBN 978-3-030-26636-3.
Typ
Stať ve sborníku
Anotace
Choosing a suitable visualization for data is a difficult task. Current data visualization recommender systems exist to aid in choosing a visualization, yet suffer from issues such as low accessibility and indecisiveness. In this study, we first define a step-by-step guide on how to build a data visualization recommender system. We then use this guide to create a model for a data visualization recommender system for non-experts that aims to resolve the issues of current solutions. The result is a question-based model that uses a decision tree and a data visualization classification hierarchy in order to recommend a visualization. Furthermore, it incorporates both task-driven and data characteristics-driven perspectives, whereas existing solutions seem to either convolute these or focus on one of the two exclusively. Based on testing against existing solutions, it is shown that the new model reaches similar results while being simpler, clearer, more versatile, extendable and transparent. The presented guide can be used as a manual for anyone building a data visualization recommender system. The resulting model can be applied in the development of new data visualization software or as part of a learning tool.

Missing Features Reconstruction and Its Impact on Classification Accuracy

Rok
2019
Publikováno
Computational Science – ICCS 2019. Springer, Cham, 2019. p. 207-220. vol. 11538. ISBN 978-3-030-22744-9.
Typ
Stať ve sborníku
Anotace
In real-world applications, we can encounter situations when a well-trained model has to be used to predict from a damaged dataset. The damage caused by missing or corrupted values can be either on the level of individual instances or on the level of entire features. Both situations have a negative impact on the usability of the model on such a dataset. This paper focuses on the scenario where entire features are missing which can be understood as a specific case of transfer learning. Our aim is to experimentally research the influence of various imputation methods on the performance of several classification models. The imputation impact is researched on a combination of traditional methods such as k-NN, linear regression, and MICE compared to modern imputation methods such as multi-layer perceptron (MLP) and gradient boosted trees (XGBT). For linear regression, MLP, and XGBT we also propose two approaches to using them for multiple features imputation. The experiments were performed on both real world and artificial datasets with continuous features where different numbers of features, varying from one feature to 50%, were missing. The results show that MICE and linear regression are generally good imputers regardless of the conditions. On the other hand, the performance of MLP and XGBT is strongly dataset dependent. Their performance is the best in some cases, but more often they perform worse than MICE or linear regression.

Chameleon 2: An Improved Graph-Based Clustering Algorithm

Autoři
Kordík, P.; Bartoň, T.; Brůna, T.
Rok
2019
Publikováno
ACM Transactions on Knowledge Discovery from Data. 2019, 13(1), ISSN 1556-4681.
Typ
Článek
Anotace
Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions, or in the presence of noise. We propose an improved graph-based clustering algorithm called Chameleon 2, which overcomes several drawbacks of state-of-the-art clustering approaches. We modified the internal cluster quality measure and added an extra step to ensure algorithm robustness. Our results reveal a significant positive impact on the clustering quality measured by Normalized Mutual Information on 32 artificial datasets used in the clustering literature. This significant improvement is also confirmed on real-world datasets. The performance of clustering algorithms such as DBSCAN is extremely parameter sensitive, and exhaustive manual parameter tuning is necessary to obtain a meaningful result. All hierarchical clustering methods are very sensitive to cutoff selection, and a human expert is often required to find the true cutoff for each clustering result. We present an automated cutoff selection method that enables the Chameleon 2 algorithm to generate high-quality clustering in autonomous mode.

Comparing Offline and Online Evaluation Results of Recommender Systems

Autoři
Kordík, P.; Řehořek, T.; Podsztavek, O.; Bíža, O.; Bartyzal, R.; Povalyev, I.P.
Rok
2018
Publikováno
REVEAL RecSyS 2018 workshop proceedings. New York: ACM, 2018.
Typ
Stať ve sborníku
Anotace
Recommender systems are usually trained and evaluated on historical data. Offline evaluation is, however, tricky and offline performance can be an inaccurate predictor of the online performance measured in production due to several reasons. In this paper, we experiment with two offline evaluation strategies and show that even a reasonable and popular strategy can produce results that are not just biased, but also in direct conflict with the true performance obtained in the online evaluation. We investigate offline policy evaluation techniques adapted from reinforcement learning and explain why such techniques fail to produce an unbiased estimate of the online performance in the “watch next” scenario of a large-scale movie recommender system. Finally, we introduce a new evaluation technique based on Jaccard Index and show that it correlates with the online performance.

An Overview of Transfer Learning Focused on Asymmetric Heterogeneous Approaches

Rok
2018
Publikováno
Data Management Technologies and Applications. Cham: Springer International Publishing, 2018. p. 3-26. vol. 814. ISSN 1865-0929. ISBN 978-3-319-94809-6.
Typ
Stať ve sborníku
Anotace
In practice we often encounter classification tasks. In order to solve these tasks, we need a sufficient amount of quality data for the construction of an accurate classification model. However, in some cases, the collection of quality data poses a demanding challenge in terms of time and finances. For example in the medical area, we encounter lack of data about patients. Transfer learning introduces the idea that a possible solution can be combining data from different domains represented by different feature spaces relating to the same task. We can also transfer knowledge from a different but related task that has been learned already. This overview focuses on the current progress in the novel area of asymmetric heterogeneous transfer learning. We discuss approaches and methods for solving these types of transfer learning tasks. Furthermore, we mention the most used metrics and the possibility of using metric or similarity learning.

Discovering predictive ensembles for transfer learning and meta-learning

Autoři
Kordík, P.; Frýda, T.; Černý, J.
Rok
2018
Publikováno
Machine Learning. 2018, 107(1), 177-207. ISSN 0885-6125.
Typ
Článek
Anotace
Recent meta-learning approaches are oriented towards algorithm selection, optimization or recommendation of existing algorithms. In this article we show how data-tailored algorithms can be constructed from building blocks on small data sub-samples. Building blocks, typically weak learners, are optimized and evolved into data-tailored hierarchical ensembles. Good-performing algorithms discovered by evolutionary algorithm can be reused on data sets of comparable complexity. Furthermore, these algorithms can be scaled up to model large data sets. We demonstrate how one particular template (simple ensemble of fast sigmoidal regression models) outperforms state-of-the-art approaches on the Airline data set. Evolved hierarchical ensembles can therefore be beneficial as algorithmic building blocks in meta-learning, including meta-learning at scale.

Asymmetric Heterogeneous Transfer Learning: A Survey

Rok
2017
Publikováno
Proceedings of the 6th International Conference on Data Science, Technology and Applications. Porto: SciTePress - Science and Technology Publications, 2017. p. 17-27. vol. 1. ISBN 978-989-758-255-4.
Typ
Stať ve sborníku
Anotace
One of the main prerequisites in most machine learning and data mining tasks is that all available data originates from the same domain. In practice, we often can’t meet this requirement due to poor quality, unavailable data or missing data attributes (new task, e.g. cold-start problem). A possible solution can be the combination of data from different domains represented by different feature spaces, which relate to the same task. We can also transfer the knowledge from a different but related task that has been learned already. Such a solution is called transfer learning and it is very helpful in cases where collecting data is expensive, difficult or impossible. This overview focuses on the current progress in the new and unique area of transfer learning - asymmetric heterogeneous transfer learning. This type of transfer learning considers the same task solved using data from different feature spaces. Through suitable mappings between these different feature spaces we can get more data for solving data mining tasks. We discuss approaches and methods for solving this type of transfer learning tasks. Furthermore, we mention the most used metrics and the possibility of using metric or similarity learning.

Neural Turing Machine for Sequential Learning of Human Mobility Patterns

Autoři
Kordík, P.; Tkačík, J.
Rok
2016
Publikováno
2016 International Joint Conference on Neural Networks (IJCNN). San Francisco: American Institute of Physics and Magnetic Society of the IEEE, 2016. p. 2790-2797. ISSN 2161-4407. ISBN 978-1-5090-0620-5.
Typ
Stať ve sborníku
Anotace
The capacity of recurrent neural networks to learn complex sequential patterns is improving. Recent developments such as Clockwork RNN, Stack RNN, Memory networks and Neural Turing Machine all aim to increase long-term memory capacity of recurrent neural networks. In this study, we investigate properties of Neural Turing Machine, compare it with ensembles of Stack RNN on artificial benchmarks and applied it to learn human mobility patterns. We show, that Neural Turing Machine based predictor outperformed not only n-gram based prediction, but also neighborhood based predictor, that was designed to solve this particular problem. Our models will be deployed in anti-drug police department to predict mobility of suspects.

Supervised two-step feature extraction for structured representation of text data

Autoři
Kordík, P.; Skrbek, M.; Háva, O.
Rok
2013
Publikováno
Simulation Modelling Practice and Theory. 2013, 33(33), 132-143. ISSN 1569-190X.
Typ
Článek
Anotace
Training data matrix used for classification of text documents to multiple categories is characterized by large number of dimensions while the number of manually classified training documents is relatively small. Thus the suitable dimensionality reduction techniques are required to be able to develop the classifier. The article describes two-step supervised feature extraction method that takes advantage of projections of terms into document and category spaces. We propose several enhancements that make the method more efficient and faster than it was presented in our former paper. We also introduce the adjustment score that enables to correct defected targets or helps to identify improper training examples that bias extracted features.

Contextual latent semantic networks used for document classification

Autoři
Skrbek, M.; Kordík, P.; Háva, O.
Rok
2012
Publikováno
Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Porto: SciTePress - Science and Technology Publications, 2012. pp. 425-430. ISBN 978-989-8565-29-7.
Typ
Stať ve sborníku
Anotace
Widely used document classifiers are developed over a bag-of-words representation of documents. Latent semantic analysis based on singular value decomposition is often employed to reduce the dimensionality of such representation. This approach overlooks word order in a text that can improve the quality of classifier. We propose language independent method that records the context of particular word into a context network utilizing products of latent semantic analysis. Words' contexts networks are combined to one network that represents a document. A new document is classified based on a similarity between its network and training documents networks. The experiments show that proposed classifier achieves better performance than common classifiers especially when a foregoing reduction of dimensionality is significant.

Document Classification with Supervised Latent Feature Selection

Autoři
Skrbek, M.; Kordík, P.; Háva, O.
Rok
2012
Publikováno
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2012. pp. 70-74. ISBN 978-1-4503-0915-8.
Typ
Stať ve sborníku
Anotace
The classification of text documents generally deals with large dimensional data. To favor generality of classification researcher has to apply a dimensionality reduction technique before building a classifier. We propose classification and reduction algorithm that makes use of latent uncorrelated topics extracted from training documents and their known categories. We suggest several latent feature selection options and provide their testing.

Za obsah stránky zodpovídá: doc. Ing. Štěpán Starosta, Ph.D.