doc. Ing. Pavel Kordík, Ph.D.

beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems

Authors

Vančura, V.; Kordík, P.; Straka, M.

Year

2024

Published

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems. New York: ACM, 2024. p. 1102-1107. ISBN 979-8-4007-0505-2.

Type

Proceedings paper

DOI

10.1145/3640457.3691707

Departments

Department of Applied Mathematics

Annotation

Recommender systems often use text-side information to improve their predictions, especially in cold-start or zero-shot recommendation scenarios, where traditional collaborative filtering approaches cannot be used. Many approaches to text-mining side information for recommender systems have been proposed over recent years, with sentence Transformers being the most prominent one. However, these models are trained to predict semantic similarity without utilizing interaction data with hidden patterns specific to recommender systems. In this paper, we propose beeFormer, a framework for training sentence Transformer models with interaction data. We demonstrate that our models trained with beeFormer can transfer knowledge between datasets while outperforming not only semantic similarity sentence Transformers but also traditional collaborative filtering methods. We also show that training on multiple datasets from different domains accumulates knowledge in a single model, unlocking the possibility of training universal, domain-agnostic sentence Transformer models to mine text representations for recommender systems. We release the source code, trained models, and additional details allowing replication of our experiments at https://github.com/recombee/beeformer.

Bridging Offline-Online Evaluation with a Time-dependent and Popularity Bias-free Offline Metric for Recommenders

Authors

Kasalický, P.; da Silva Alves, R.; Kordík, P.

Year

2023

Published

Proceedings of EvalRS: A Rounded Evaluation Of Recommender Systems 2023. CEUR-WS.org, 2023. ISSN 1613-0073.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

The evaluation of recommendation systems is a complex task. The offline and online evaluation metrics for recommender systems are ambiguous in their true objectives. The majority of recently published papers benchmark their methods using ill-posed offline evaluation methodology that often fails to predict true online performance. Because of this, the impact that academic research has on the industry is reduced. The aim of our research is to investigate and compare the online performance of offline evaluation metrics. We show that penalizing popular items and considering the time of transactions during the evaluation significantly improves our ability to choose the best recommendation model for a live recommender system. Our results, averaged over five large-size real-world live data procured from recommenders, aim to help the academic community to understand better offline evaluation and optimization criteria that are more relevant for real applications of recommender systems.

Improving recommendation diversity and serendipity with an ontology-based algorithm for cold start environments

Authors

Kuznetsov, S.; Kordík, P.

Year

2023

Published

International Journal of Data Science and Analytics. 2023, 2023(1), ISSN 2364-415X.

Type

Article

DOI

10.1007/s41060-023-00418-4

Departments

Department of Software Engineering
Department of Applied Mathematics

Annotation

Every real-life environments where users interact with items (products, films, research expert profiles) have several development phases. In the Cold-start phase, there are almost no interactions among users and items content-based recommendation systems (RS) can only recommend based on matching the attributes of the items. In the transition state, items start to collect user interactions but still a significant number of items have too small number of interactions, RS does not allow users to discover cold items. In a regular state, where most of the items in the system have enough interactions, the recommendations often suffer from low diversity of the items within a single recommendation. This article proposes a general recommendation algorithm based on Ontological-similarity, which is designed to address all the above problems. Our experiments show that recommendations generated by our approach are consistently better in all environment development phases and increase the success rate of recommendations by almost 50% measured using ontology-aware recall, which is also introduced in this article.

Overcoming the Cold-Start Problem in Recommendation Systems with Ontologies and Knowledge Graphs

Authors

Kuznetsov, S.; Kordík, P.

Year

2023

Published

New Trends in Database and Information Systems. Cham: Springer, 2023. p. 591-603. Communications in Computer and Information Science. vol. 1850. ISSN 1865-0929. ISBN 978-3-031-42940-8.

Type

Proceedings paper

DOI

10.1007/978-3-031-42941-5_52

Departments

Department of Software Engineering
Department of Applied Mathematics

Annotation

Many recommendation systems struggle with the cold-start problem, especially in the early stages of a new application, when there are few active users and limited interactions. Traditional approaches like Collaborative Filtering cannot provide enough recommendations, and text-based methods, while helpful, do not offer sufficient context. This paper argues against the idea that the cold-start phase will eventually disappear and proposes a solution to enhance recommendation performance from the start. We propose using Ontologies and Knowledge Graphs to add a semantic layer to text-based methods and improve the recommendation performance in cold-start scenarios. Our approach uses ontologies to generate a knowledge graph that captures item text attributes’ implicit and explicit characteristics, extending the item profile with similar semantic keywords. We evaluate our method against state-of-the-art text feature extraction techniques and present the results of our experiments.

Personalised Recommendations and Profile Based Re-ranking Improve Distribution of Student Opportunities

Authors

Žid, Č.; Kordík, P.; Kuznetsov, S.

Year

2023

Published

International Joint Conference 16th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2023) 14th International Conference on EUropean Transnational Education (ICEUTE 2023). Cham: Springer, 2023. p. 217-227. Lecture Notes in Networks and Systems. vol. 748 LNNS. ISSN 2367-3370. ISBN 978-3-031-42518-9.

Type

Proceedings paper

DOI

10.1007/978-3-031-42519-6_21

Departments

Department of Software Engineering
Department of Applied Mathematics

Annotation

Modern technical universities help students get practical experience. They educate thousands of students and it is hard for them to connect individual students with relevant industry experts and opportunities. This article aims to solve this problem by designing a matchmaking procedure powered by a recommendation system, an ontology, and knowledge graphs. We suggest improving recommendations and reducing the cold-start problem with a re-ranking module based on student educational profiles for students who opt-in. Each student profile is represented as a knowledge graph derived from the successfully completed courses of the individual. The system was tested in an online experiment and demonstrated that recommendations based on student educational profiles and their interaction history significantly improve conversion rates over non-personalised offers.

Adapting the Size of Artificial Neural Networks Using Dynamic Auto-Sizing

Authors

Cahlík, V.; Kordík, P.; Čepek, M.

Year

2022

Published

IEEE 17th International Conference on Computer Science and Information Technologies. Dortmund: IEEE, 2022. p. 592-596. ISBN 979-8-3503-3431-9.

Type

Proceedings paper

DOI

10.1109/CSIT56902.2022.10000471

Departments

Department of Applied Mathematics

Annotation

We introduce dynamic auto-sizing, a novel approach to training artificial neural networks which allows the models to automatically adapt their size to the problem domain. The size of the models can be further controlled during the learning process by modifying the applied strength of regularization. The ability of dynamic auto-sizing models to expand or shrink their hidden layers is achieved by periodically growing and pruning entire units such as neurons or filters. For this purpose, we introduce weighted L1 regularization, a novel regularization method for inducing structured sparsity. Besides analyzing the behavior of dynamic auto-sizing, we evaluate predictive performance of models trained using the method and show that such models can provide a predictive advantage over traditional approaches.

Discriminant Analysis on a Stream of Features

Authors

Motl, J.; Kordík, P.

Year

2022

Published

Engineering Applications of Neural Networks. Springer, Cham, 2022. p. 223-234. ISSN 1865-0929. ISBN 978-3-031-08222-1.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

Online learning is a well-established problem in machine learning. But while online learning is commonly concerned with learning on a stream of samples, this article is concerned with learning on a stream on features. An online quadratic discriminant analysis (QDA) is proposed because it is fast, capable of modeling feature interactions, and it can still return an exact solution. When a new feature is inserted into a training set, the proposed implementation of QDA showed a 1000-fold speed up to scikit-learn QDA. Fast learning on a stream of features provides a data scientist with timely feedback about the importance of new features during the feature engineering phase. In the production phase, it reduces the cost of updating a model when a new source of potentially useful features appears.

Learning to Optimize with Dynamic Mode Decomposition

Authors

Šimánek, P.; Vašata, D.; Kordík, P.

Year

2022

Published

2022 International Joint Conference on Neural Networks (IJCNN). Vienna: IEEE Industrial Electronic Society, 2022. p. 1-8. ISSN 2161-4407. ISBN 978-1-7281-8671-9.

Type

Proceedings paper

DOI

10.1109/IJCNN55064.2022.9892364

Departments

Department of Applied Mathematics

Annotation

Designing faster optimization algorithms is of ever-growing interest. In recent years, learning to learn methods that learn how to optimize demonstrated very encouraging results. Current approaches usually do not effectively include the dynamics of the optimization process during training. They either omit it entirely or only implicitly assume the dynamics of an isolated parameter. In this paper, we show how to utilize the dynamic mode decomposition method for extracting informative features about optimization dynamics. By employing those features, we show that our learned optimizer generalizes much better to unseen optimization problems in short. The improved generalization is illustrated on multiple tasks where training the optimizer on one neural network generalizes to different architectures and distinct datasets.

Offline evaluation of the serendipity in recommendation systems

Authors

Pastukhov, D.; Kuznetsov, S.; Vančura, V.; Kordík, P.

Year

2022

Published

IEEE 17th International Conference on Computer Science and Information Technologies. Dortmund: IEEE, 2022. p. 597-601. ISBN 979-8-3503-3431-9.

Type

Proceedings paper

DOI

10.1109/CSIT56902.2022.10000782

Departments

Department of Software Engineering
Department of Applied Mathematics
Faculty of Information Technology

Annotation

Offline optimization of recommender systems is a difficult task. Popular optimization criteria such as RMSE, Recall, and NDCG do not correlate much with online performance, especially when the recommendation algorithm is largely different from the one used to generate the offline data. An exciting direction of research to mitigate this problem is to use more robust optimization criteria. Serendipity is reported to be a promising proxy. However, more variants exist, and it is unclear whether they can be used as a single criterion to optimize. This paper analyzes how serendipity relates to other optimization criteria for three different recommendation algorithms. Based on our findings, we propose to modify the way serendipity is computed. We conduct experiments using three collaborative filtering algorithms: K-Nearest Neighbors, Matrix Factorization, and Embarrassingly Shallow Autoencoder (EASE). We also employ and evaluate the ensemble learning approach and analyze the importance of the individual components of serendipity.

RepSys: Framework for Interactive Evaluation of Recommender Systems

Authors

Šafařík, J.; Vančura, V.; Kordík, P.

Year

2022

Published

RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems. New York: Association for Computing Machinery, 2022. p. 636-639. ISBN 978-1-4503-9278-5.

Type

Proceedings paper

DOI

10.1145/3523227.3551469

Departments

Department of Applied Mathematics

Annotation

Making recommender systems more transparent and auditable is crucial for the future adoption of these systems. Available tools typically present mostly errors of models aggregated over all test users, which is often insufficient to uncover hidden biases and problems. Moreover, the emphasis is primarily on the accuracy of recommendations but less on other important metrics, such as the diversity of recommended items, the extent of catalog coverage, or the opportunity to discover novel items at bestsellers’ expense. In this work, we propose RepSys, a framework for evaluating recommender systems. Our work offers a set of highly interactive approaches for investigating various scenario recommendations, analyzing a dataset, and evaluating distributions of various metrics that combine visualization techniques with existing offline evaluation methods. RepSys framework is available under an open-source license to other researchers.

Scalable Linear Shallow Autoencoder for Collaborative Filtering

Authors

Vančura, V.; da Silva Alves, R.; Kasalický, P.; Kordík, P.

Year

2022

Published

RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems. New York: Association for Computing Machinery, 2022. p. 604-609. ISBN 978-1-4503-9278-5.

Type

Proceedings paper

DOI

10.1145/3523227.3551482

Departments

Department of Applied Mathematics

Annotation

Recently, the RS research community has witnessed a surge in popularity for shallow autoencoder-based CF methods. Due to its straightforward implementation and high accuracy on item retrieval metrics, EASE is potentially the most prominent of these models. Despite its accuracy and simplicity, EASE cannot be employed in some real-world recommender system applications due to its inability to scale to huge interaction matrices. In this paper, we proposed ELSA, a scalable shallow autoencoder method for implicit feedback recommenders. ELSA is a scalable autoencoder in which the hidden layer is factorizable into a low-rank plus sparse structure, thereby drastically lowering memory consumption and computation time. We conducted a comprehensive offline experimental section that combined synthetic and several real-world datasets. We also validated our strategy in an online setting by comparing ELSA to baselines in a live recommender system using an A/B test. Experiments demonstrate that ELSA is scalable and has competitive performance. Finally, we demonstrate the explainability of ELSA by illustrating the recovered latent space.

Trend and Seasonality Elimination from Relational Data

Authors

Motl, J.; Kordík, P.

Year

2022

Published

Engineering Applications of Neural Networks. Springer, Cham, 2022. p. 260-268. ISSN 1865-0929. ISBN 978-3-031-08222-1.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

Detrending and deseasoning is a common preprocessing step in time-series analysis. We argue that the same preprocessing step should be considered on relational data whenever the observations are time-dependent. We applied Hierarchical Generalized Additive Models (HGAMs) to detrend and deseason (D&D) 18 real-world relational datasets. The observed positive effect of D&D on the predictive accuracy is statistically significant. The proposed method of D&D might be used to improve the predictive accuracy of churn, default, or propensity models, among others.

Aggregate Function Generalization to Temporal Data

Authors

Motl, J.; Kordík, P.

Year

2021

Published

2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI). Los Alamitos: IEEE Computer Society, 2021. p. 614-618. ISSN 2375-0197. ISBN 978-1-6654-0898-1.

Type

Proceedings paper

DOI

10.1109/ICTAI52525.2021.00098

Departments

Department of Applied Mathematics

Annotation

In this article, we define an approximate generalization of aggregate functions for relational data with temporal attributes. This generalization is parametrized to allow simulation of a range of common aggregate functions and optionally take into account time. The parameters are not optimized, but we rather rely on repeated stochastic sampling of the parameters. We then apply a common regularized linear model to train a model on this high-dimensional space. Experimental results on 11 datasets suggest that there are datasets where incorporating time dimension into the model leads to an improvement in the predictive accuracy of the trained models.

Deep Variational Autoencoder with Shallow Parallel Path for Top-N Recommendation (VASP)

Authors

Vančura, V.; Kordík, P.

Year

2021

Published

Artificial Neural Networks and Machine Learning – ICANN 2021. Cham: Springer, 2021. p. 138-149. V. vol. 12895. ISSN 1611-3349. ISBN 978-3-030-86383-8.

Type

Proceedings paper

DOI

10.1007/978-3-030-86383-8_11

Departments

Department of Applied Mathematics

Annotation

The recently introduced Embarrasingelly Shallow Autoencoder (EASE) algorithm presents a simple and elegant way to solve the top-N recommendation task. In this paper, we introduce Neural EASE to further improve the performance of this algorithm by incorporating techniques for training modern neural networks. Also, there is a growing interest in the recsys community to utilize variational autoencoders (VAE) for this task. We introduce Focal Loss Variational AutoEncoder (FLVAE), benefiting from multiple non-linear layers without an information bottleneck while not overfitting towards the identity. We show how to learn FLVAE in parallel with Neural EASE and achieve state-of-the-art performance on the MovieLens 20M dataset and competitive results on the Netflix Prize dataset.

Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks

Authors

Kovalenko, A.; Kordík, P.; Friedjungová, M.

Year

2021

Published

Artificial Neural Networks and Machine Learning – ICANN 2021. Cham: Springer, 2021. p. 235-247. 1. vol. 12892. ISSN 1611-3349. ISBN 978-3-030-86339-5.

Type

Proceedings paper

DOI

10.1007/978-3-030-86340-1_19

Departments

Department of Applied Mathematics

Annotation

Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks, where now excessively large models are used. However, such models face several problems during the learning process, mainly due to the redundancy of the individual neurons, which results in sub-optimal accuracy or the need for additional training steps. Here, we explore the diversity of the neurons within the hidden layer during the learning process, and analyze how the diversity of the neurons affects predictions of the model. As following, we introduce several techniques to dynamically reinforce diversity between neurons during the training. These decorrelation techniques improve learning at early stages and occasionally help to overcome local minima faster. Additionally, we describe novel weight initialization method to obtain decorrelated, yet stochastic weight initialization for a fast and efficient neural network training. Decorrelated weight initialization in our case shows about 40% relative increase in test accuracy during the first 5 epochs.

Novel data mining-based age-at-death estimation model using adult pubic symphysis 3D scans

Authors

Buk, Z.; Štepanovský, M.; Kotěrová, A.; Velemínská, J.; Brůžek, J.; Kordík, P.

Year

2021

Published

Proceedings of the 21st Conference Information Technologies – Applications and Theory (ITAT 2021). Aachen: CEUR Workshop Proceedings, 2021. p. 46-52. vol. 2962. ISSN 1613-0073.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science
Department of Computer Systems
Department of Applied Mathematics

Annotation

The paper introduces a novel age-at-death estimation model based on Convolutional Neural Network (CNN). The model uses 3D scan of human pubic symphysis as an input and estimates the age-at-death of the individual as an output. The Mean Absolute Error (MAE) of this model is about 10.6 years for individuals between 18 and 92 years of age-at-death. Moreover, the results of the study indicate that pubic symphysis can be used to estimate the age of individuals across the entire age range. The study involved a sample of 483 bone scans collected from 374 individuals (from which 109 individuals provided both left and right pubic symphysis).

Stratified Cross-Validation on Multiple Columns

Authors

Motl, J.; Kordík, P.

Year

2021

Published

2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI). Los Alamitos: IEEE Computer Society, 2021. p. 26-31. ISSN 1082-3409. ISBN 978-1-6654-0899-8.

Type

Proceedings paper

DOI

10.1109/ICTAI52525.2021.00012

Departments

Department of Applied Mathematics

Annotation

Stratified cross-validation is one of the standard methods of how to evaluate classifier’s generalization accuracy. However, conventional implementations of cross-validation can stratify only by a single column. In this paper, we propose to utilize Integer Linear Programming in order to enable stratification by multiple columns. Our experiments using an extensive set of multi-label data sets shows that the proposed method significantly outperforms non-stratified cross-validation.

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

Authors

Chobola, T.; Vašata, D.; Kordík, P.

Year

2021

Published

AAAI Workshop on Meta-Learning and MetaDL Challenge. Proceedings of Machine Learning Research, 2021. p. 29-37. vol. 140. ISSN 2640-3498.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

The MetaDL Challenge 2020 focused on image classification tasks in few-shot settings. This paper describes second best submission in the competition. Our meta learning approach modifies the distribution of classes in a latent space produced by a backbone network for each class in order to better follow the Gaussian distribution. After this operation which we call Latent Space Transform algorithm, centers of classes are further aligned in an iterative fashion of the Expectation Maximisation algorithm to utilize information in unlabeled data that are often provided on top of few labelled instances. For this task, we utilize optimal transport mapping using the Sinkhorn algorithm. Our experiments show that this approach outperforms previous works as well as other variants of the algorithm, using K-Nearest Neighbour algorithm, Gaussian Mixture Models, etc.

Chameleon 2: An Improved Graph-Based Clustering Algorithm

Authors

Bartoň, T.; Brůna, T.; Kordík, P.

Year

2019

Published

ACM Transactions on Knowledge Discovery from Data. 2019, 13(1), ISSN 1556-4681.

Type

Article

DOI

10.1145/3299876

Departments

Department of Applied Mathematics
Faculty of Information Technology

Annotation

Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions, or in the presence of noise. We propose an improved graph-based clustering algorithm called Chameleon 2, which overcomes several drawbacks of state-of-the-art clustering approaches. We modified the internal cluster quality measure and added an extra step to ensure algorithm robustness. Our results reveal a significant positive impact on the clustering quality measured by Normalized Mutual Information on 32 artificial datasets used in the clustering literature. This significant improvement is also confirmed on real-world datasets. The performance of clustering algorithms such as DBSCAN is extremely parameter sensitive, and exhaustive manual parameter tuning is necessary to obtain a meaningful result. All hierarchical clustering methods are very sensitive to cutoff selection, and a human expert is often required to find the true cutoff for each clustering result. We present an automated cutoff selection method that enables the Chameleon 2 algorithm to generate high-quality clustering in autonomous mode.

Learning on a Stream of Features with Random Forest

Authors

Motl, J.; Kordík, P.

Year

2019

Published

Proceedings of the 19th Conference Information Technologies - Applications and Theory (ITAT 2019). Aachen: CEUR Workshop Proceedings, 2019. p. 79-83. vol. 2473. ISSN 1613-0073.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

We study an interesting and challenging problem, supervised learning on a stream of features, in which the size of the feature set is unknown, and not all features are available for learning while leaving the number of observations constant. In this problem, the features arrive one at a time, and the learner’s task is to train a model equivalent to a model trained from "scratch". When a new feature is inserted into the training set, a new set of trees is trained and added into the current forest. However, it is desirable to correct the selection bias: older features has more opportunities to get selected into trees than the new features. We combat the selection bias by adjusting the feature selection distribution. However, while this correction works well, it may require training of many new trees. In order to keep the count of the new trees small, we furthermore put more weight on more recent trees than on the old trees.

Comparing Offline and Online Evaluation Results of Recommender Systems

Authors

Kordík, P.; Řehořek, T.; Bíža, O.; Bartyzal, R.; Podsztavek, O.; Povalyev, I.P.

Year

2018

Published

REVEAL RecSyS 2018 workshop proceedings. New York: ACM, 2018.

Type

Proceedings paper

Departments

Department of Applied Mathematics
Faculty of Information Technology

Annotation

Recommender systems are usually trained and evaluated on historical data. Offline evaluation is, however, tricky and offline performance can be an inaccurate predictor of the online performance measured in production due to several reasons. In this paper, we experiment with two offline evaluation strategies and show that even a reasonable and popular strategy can produce results that are not just biased, but also in direct conflict with the true performance obtained in the online evaluation. We investigate offline policy evaluation techniques adapted from reinforcement learning and explain why such techniques fail to produce an unbiased estimate of the online performance in the “watch next” scenario of a large-scale movie recommender system. Finally, we introduce a new evaluation technique based on Jaccard Index and show that it correlates with the online performance.

Discovering predictive ensembles for transfer learning and meta-learning

Authors

Kordík, P.; Frýda, T.; Černý, J.

Year

2018

Published

Machine Learning. 2018, 107(1), 177-207. ISSN 0885-6125.

Type

Article

DOI

10.1007/s10994-017-5682-0

Departments

Department of Theoretical Computer Science
Department of Applied Mathematics

Annotation

Recent meta-learning approaches are oriented towards algorithm selection, optimization or recommendation of existing algorithms. In this article we show how data-tailored algorithms can be constructed from building blocks on small data sub-samples. Building blocks, typically weak learners, are optimized and evolved into data-tailored hierarchical ensembles. Good-performing algorithms discovered by evolutionary algorithm can be reused on data sets of comparable complexity. Furthermore, these algorithms can be scaled up to model large data sets. We demonstrate how one particular template (simple ensemble of fast sigmoidal regression models) outperforms state-of-the-art approaches on the Airline data set. Evolved hierarchical ensembles can therefore be beneficial as algorithmic building blocks in meta-learning, including meta-learning at scale.

Do We Need to Observe Features to Perform Feature Selection?

Authors

Motl, J.; Kordík, P.

Year

2018

Published

Proceedings of the 18th Conference Information Technologies - Applications and Theory (ITAT 2018). Aachen: CEUR Workshop Proceedings, 2018. p. 44-51. vol. 2203. ISSN 1613-0073. ISBN 9781727267198.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

Many feature selection methods were developed in the past, but in the core, they all work the same way — you pass a set of features to the algorithm and get a reduced set of the features. But can we perform a non-trivial feature selection without first observing the features? This is an important question because if we were actually able to predict feature importance before observing the features, we would reduce computation requirements of all stages of machine learning process beginning with feature engineering. In this article, we argue that it is possible to predict feature importance before feature vector observation. The trick is that we use meta-features about the features to perform the feature selection. We evaluate the concept on 15 relational databases. On average, it was enough to generate the top decile of all features to get the same model accuracy as if we generated all features and passed them to the model.

On Scalability of Predictive Ensembles and Tradeoff Between Their Training Time and Accuracy

Authors

Kordík, P.; Frýda, T.

Year

2018

Published

Advances in Intelligent Systems and Computing II. Springer, Cham, 2018. p. 257-269. Advances in Intelligent Systems and Computing. vol. 689. ISSN 2194-5357. ISBN 978-3-319-70580-4.

Type

Proceedings paper

DOI

10.1007/978-3-319-70581-1_18

Departments

Department of Theoretical Computer Science

Annotation

Scalability of predictive models is often realized by data subsampling. The generalization performance of models is not the only criterion one should take into account in the algorithm selection stage. For many real world applications, predictive models have to be scalable and their training time should be in balance with their performance. For many tasks it is reasonable to save computational resources and select an algorithm with slightly lower performance and significantly lower training time. In this contribution we made extensive benchmarks of predictive algorithms scalability and examined how they are capable to trade accuracy for lower training time. We demonstrate how one particular template (simple ensemble of fast sigmoidal regression models) outperforms state-of-the-art approaches on the Airline data set.

Violation of Independence of Irrelevant Alternatives in Friedman’s test

Authors

Motl, J.; Kordík, P.

Year

2018

Published

Proceedings of the 18th Conference Information Technologies - Applications and Theory (ITAT 2018). Aachen: CEUR Workshop Proceedings, 2018. p. 59-63. vol. 2203. ISSN 1613-0073. ISBN 9781727267198.

Type

Proceedings paper

Departments

Department of Applied Mathematics

Annotation

One of the most common methods for classifier comparison is Friedman’s test. However, Friedman’s test has a known flaw — ranking of classifiers A and B does not depend only on the properties of classifiers A and B, but also on the properties of all other evaluated classifiers. We illustrate the issue on a question: “What is better, bagging or boosting?”. With Friedman’s test, the answer depends on the presence/absence of irrelevant classifiers in the experiment. Based on the application of Friedman’s test on an experiment with 179 classifiers and 121 datasets we conclude that it is very easy to game the ranking of two insignificantly different classifiers. But once the difference becomes significant, it is unlikely that by removing irrelevant classifiers we obtain significant results with reversed ranking.

Foreign Key Constraint Identification in Relational Databases

Authors

Motl, J.; Kordík, P.

Year

2017

Published

ITAT 2017: Information Technologies – Applications and Theory. Aachen: CEUR Workshop Proceedings, 2017. p. 106-111. vol. 1885. ISSN 1613-0073.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

For relational learning, it is important to know the relationships between the tables. In relational databases, the relationships can be described with foreign key constraints. However, the foreign keys may not be explicitly specified. In this article, we present how to automatically and quickly identify primary & foreign key constraints from metadata about the data. Our method was evaluated on 72 databases and has F-measure of 0.87 for foreign key constraint identification. The proposed method significantly outperforms in runtime related methods reported in the literature and is database vendor agnostic.

Scalability of predictive ensembles

Authors

Kordík, P.; Frýda, T.; Šnorek, M.; Cepek, M.

Year

2017

Published

2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). IEEE (Institute of Electrical and Electronics Engineers), 2017. p. 555-560. ISBN 978-1-5386-1639-0.

Type

Proceedings paper

DOI

10.1109/STC-CSIT.2017.8098848

Departments

Department of Software Engineering
Department of Applied Mathematics

Annotation

Recent meta-learning approaches are oriented towards algorithm selection, optimization or recommendation of existing algorithms. In this paper we show how inductive algorithms constructed from building blocks on small data subsample can be scaled up to model large data sets. We demonstrate how one particular template (simple ensemble of fast sigmoidal regression models) outperforms state-of-the-art approaches on the Airline data set.

MoCham: Robust Hierarchical Clustering Based on Multi-objective Optimization

Authors

Bartoň, T.; Brůna, T.; Kordík, P.

Year

2016

Published

2016 IEEE 16th International Conference on Data Mining Workshops. USA: IEEE Computer Society, 2016. p. 831-838. 16. vol. 1. ISSN 2375-9259. ISBN 978-1-5090-5910-2.

Type

Proceedings paper

DOI

10.1109/ICDMW.2016.0123

Departments

Department of Theoretical Computer Science

Annotation

Many clustering evaluation methods are computed as a ratio between two objectives, typically these objectives express the compactness of all clusters while trying to maximize the separation between individual clusters. However, the clustering process itself is typically implemented as a single objective problem: optimizing a linear combination of between-points closeness. We propose MoCham - a hierarchical clustering algorithm that uses a multi-objective optimization for finding the optimal data points to merge. Our results suggest that a careful candidate selection of Pareto dominating pairs leads to more robust clustering results.

Neural Turing Machine for Sequential Learning of Human Mobility Patterns

Authors

Kordík, P.; Tkačík, J.

Year

2016

Published

2016 International Joint Conference on Neural Networks (IJCNN). San Francisco: American Institute of Physics and Magnetic Society of the IEEE, 2016. p. 2790-2797. ISSN 2161-4407. ISBN 978-1-5090-0620-5.

Type

Proceedings paper

DOI

10.1109/IJCNN.2016.7727551

Departments

Department of Theoretical Computer Science

Annotation

The capacity of recurrent neural networks to learn complex sequential patterns is improving. Recent developments such as Clockwork RNN, Stack RNN, Memory networks and Neural Turing Machine all aim to increase long-term memory capacity of recurrent neural networks. In this study, we investigate properties of Neural Turing Machine, compare it with ensembles of Stack RNN on artificial benchmarks and applied it to learn human mobility patterns. We show, that Neural Turing Machine based predictor outperformed not only n-gram based prediction, but also neighborhood based predictor, that was designed to solve this particular problem. Our models will be deployed in anti-drug police department to predict mobility of suspects.

Reducing Cold Start Problems in Educational Recommender Systems

Authors

Kuznetsov, S.; Kordík, P.; Řehořek, T.; Kroha, P.; Dvořák, J.

Year

2016

Published

2016 International Joint Conference on Neural Networks (IJCNN). San Francisco: American Institute of Physics and Magnetic Society of the IEEE, 2016. p. 3143-3149. ISSN 2161-4407. ISBN 978-1-5090-0620-5.

Type

Proceedings paper

DOI

10.1109/IJCNN.2016.7727600

Departments

Department of Theoretical Computer Science
Department of Software Engineering

Annotation

Educational data can help us to personalise university information systems. In this paper, we show how educational data can be used to improve the performance of interaction-based recommender systems. Educational data is transformed to student profiles helping to prevent cold start problems when recommending projects to students with few user interactions. Our results show that our hybrid interaction based recommender boosted by educational profiles significantly outperforms bestseller recommendation, which is a mainstream recommendation method for cold start users.

Evaluation of relative indexes for multi-objective clustering

Authors

Bartoň, T.; Kordík, P.

Year

2015

Published

Hybrid Artificial Intelligent Systems. Cham: Springer International Publishing, 2015. p. 465-476. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-19643-5.

Type

Proceedings paper

DOI

10.1007/978-3-319-19644-2_39

Departments

Department of Theoretical Computer Science

Annotation

One of the biggest challenges in clustering is finding a robust and versatile criterion to evaluate the quality of clustering results. In this paper, we investigate the extent to which unsupervised criteria can be used to obtain clusters highly correlated to external labels. We show that the usefulness of these criteria is data-dependent and for most data sets multiple criteria are required in order to identify the best performing clustering algorithm. We present a multi-objective evolutionary clustering algorithm capable of finding a set of high-quality solutions. For the real world data sets examined the Pareto front can offer better clusterings than simply optimizing a single unsupervised criterion.

Mining skills from educational data for project recommendations

Authors

Kuznetsov, S.; Kordík, P.

Year

2015

Published

Proceedings of the International Joint Conference CISIS’15 and ICEUTE’15. Berlin: Springer-Verlag, 2015, pp. 617-627. Advances in Intelligent Systems and Computing. ISSN 2194-5357. ISBN 978-3-319-19713-5.

Type

Proceedings paper

DOI

10.1007/978-3-319-19713-5_54

Departments

Department of Theoretical Computer Science

Annotation

We are focusing on an issue regarding how to actually recognize the skills of students based on educational results. Existing approaches do not offer suitable solutions. This paper will introduce algorithms making possible to aggregate educational results using ontology. We map the aggregated results, using various methods, as skills that are understandable for external partners and usable to recommend students for projects and projects for students. We compare the results of individual algorithms with subjective assessments of students, and we apply a recommendation algorithm that closely models these skills.

Using multi-objective optimization for the selection of ensemble members

Authors

Bartoň, T.; Kordík, P.

Year

2015

Published

ITAT 2015 conference proceedings. Aachen: CEUR Workshop Proceedings, 2015. pp. 108-114. ISSN 1613-0073. ISBN 978-1-5151-2065-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In this paper we propose a clustering process which uses a multi-objective evolution to select a set of diverse clusterings. The selected clusterings are then combined using a consensus method. This approach is compared to a clustering process where no selection is applied. We show that careful selection of input ensemble members can improve the overall quality of the final clustering. Our algorithm provides more stable clustering results and in many cases overcomes the limitations of base algorithms.

Building Predictive Models in Two Stages with Meta-Learning Templates optimized by Genetic Programming

Authors

Kordík, P.; Černý, J.

Year

2014

Published

2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL) Proceedings. Piscataway: IEEE, 2014. pp. 27-34. ISBN 978-1-4799-4513-9.

Type

Proceedings paper

DOI

10.1109/CIEL.2014.7015740

Departments

Department of Theoretical Computer Science

Annotation

The model selection stage is one of the most difficult in predictive modeling. To select a model with a highest generalization performance involves benchmarking huge number of candidate models or algorithms. Often, a final model is selected without considering potentially high quality candidates just because there are too many possibilities. Improper benchmarking methodology often leads to biased estimates of model generalization performance. Automation of the model selection stage is possible, however the computational complexity is huge especially when ensembles of models and optimization of input features should be also considered. In this paper we show, how to automate model selection process in a way that allows to search for complex hierarchies of ensemble models while maintaining computational tractability. We introduce two-stage learning, meta-learning templates optimized by evolutionary programming with anytime properties to be able to deliver and maintain data-tailored algorithms and models in a reasonable time without human interaction. Co-evolution if inputs together with optimization of templates enabled to solve algorithm selection problem efficiently for variety of datasets.

Utilizing educational data in collaboration with industry

Authors

Kordík, P.; Kuznetsov, S.

Year

2014

Published

Proceedings of the 13th Annual Conference Znalosti 2014. Praha: VŠE, 2014, pp. 38-47. ISBN 978-80-245-2054-4. Available from: http://znalosti.eu/images/accepted_papers/znalosti2014_paper23.pdf

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Universities are seldom using their data efficiently. In this case study, we show how educational data can be used to recommend suitable students for project, get feedback from industrial partners, help students to focus on skills that are demanded by companies. We have developed portal for students collaboration with industrial partners and run it in a pilot for almost a year. Based on our observations described in this contribution, we are adjusting the portal to enhance the functionality and streamline processes supported by the portal.

A Soft Computing Approach to Knowledge Flow Synthesis and Optimization

Authors

Řehořek, T.; Kordík, P.

Year

2013

Published

Soft Computing Models in Industrial and Environmental Applications. Heidelberg: Springer, 2013. pp. 23-32. ISSN 2194-5357. ISBN 978-3-642-32921-0.

Type

Proceedings paper

DOI

10.1007/978-3-642-32922-7_3

Departments

Department of Theoretical Computer Science

Annotation

In the areas of Data Mining (DM) and Knowledge Discovery (KD), large variety of algorithms has been developed in the past decades, and the research is still ongoing. Data mining expertise is usually needed to deploy the algorithms available. Specifically, a process of interconnected actions referred to as knowledge flow (KF) needs to be assembled when the algorithms are to be applied to given data. In this paper, we propose an innovative evolutionary approach to automated KF synthesis and optimization. We demonstrate the evolutionary KF synthesis on the problem of classifier construction. Both preprocessing and machine learning actions are selected and configured by means of evolution to produce a model that fits very well for a given dataset.

Encoding time series data for better clustering results

Authors

Bartoň, T.; Kordík, P.

Year

2013

Published

INTERNATIONAL JOINT CONFERENCE CISIS'12 - ICEUTE'12 - SOCO'12 SPECIAL SESSIONS. Berlin: Springer, 2013. pp. 467-475. Advances in Intelligent Systems and Computing. ISSN 2194-5357. ISBN 978-3-642-33017-9.

Type

Proceedings paper

DOI

10.1007/978-3-642-33018-6_48

Departments

Department of Theoretical Computer Science

Annotation

Clustering algorithms belong to a category of unsupervised learning methods which aim to discover underlying structure in a dataset without given labels. We carry out research of methods for an analysis of a biological time series signals, putting stress on global patterns found in samples. When clustering raw time series data, high dimensionality of input vectors, correlation of inputs, shift or scaling sensitivity often deteriorates the result. In this paper, we propose to represent time series signals by various parametric models. A significant parameters are determined by means of heuristic methods and selected parameters are used for clustering. We applied this method to the data of cell's impedance profiles. Clustering results are more stable, accurate and computationally less expensive than processing raw time series data.

Interpretation of Geophysical Data at EL Fayoum-Dahshour Area, Egypt Using Three Dimensional Models

Authors

Kader, AKAK; Kordík, P.; Khalil, AK; Mekkawi, MM; El-Bohoty, MEB; Rabeh, TR; Refai, MKR; El-Mahdy, AEM

Year

2013

Published

Arabian Journal for Science and Engineering. 2013, 38(7), 1769-1784. ISSN 1319-8025.

Type

Article

DOI

10.1007/s13369-012-0385-0

Departments

Department of Theoretical Computer Science

Annotation

EL Fayoum-Cairo district lies in the north western part of Cairo city. It is affected by several earthquakes. According to the Egyptian Network Seismology of National Research Institute of Astronomy and Geophysics (NRIAG), the last one occurred in July 2005 with a magnitude of 4.2 in Richter scale. The Bouguer and the aeromagnetic anomaly data as well as the detailed land-magnetic survey have been subjected to different techniques of processing and interpretation to better understand the tectonic setting of the study area. For example, different kinds of filters such as 2D spectral analysis, 3D analytical signal, and Euler deconvolution techniques have been applied. Finally, the 2D modeling has been used to simulate the subsurface configuration along some selected profiles at the area. It can be noticed from the obtained results that, the seismic events (Dahshour earthquakes) are closely related to a major NNW normal fault which has a deep extension and slightly lateral displacement, in addition to its NE-conjugate faults. The recent seismic activities in the study area are directly related and/or associated with the rejuvenation of the lateral movements. Three-dimensional interpretations of the magnetic anomaly and Bouguer anomaly maps of Fayoum, Cairo province area, Northern Western Desert of Egypt have been presented. In addition, a high-resolution 3D magnetic and gravity model constrained with the seismic results reveals a possible crustal thickness and density distribution of the northern part of Egypt between the sedimentary cover and the mantel. The results reveal that the basement has an irregular surface with depths ranging from 3,600 to 5,500 m. ...

Supervised two-step feature extraction for structured representation of text data

Authors

Háva, O.; Kordík, P.; Skrbek, M.

Year

2013

Published

Simulation Modelling Practice and Theory. 2013, 33(33), 132-143. ISSN 1569-190X.

Type

Article

DOI

10.1016/j.simpat.2012.11.003

Departments

Department of Theoretical Computer Science

Annotation

Training data matrix used for classification of text documents to multiple categories is characterized by large number of dimensions while the number of manually classified training documents is relatively small. Thus the suitable dimensionality reduction techniques are required to be able to develop the classifier. The article describes two-step supervised feature extraction method that takes advantage of projections of terms into document and category spaces. We propose several enhancements that make the method more efficient and faster than it was presented in our former paper. We also introduce the adjustment score that enables to correct defected targets or helps to identify improper training examples that bias extracted features.

Vector representation of context networks of latent topics

Authors

Háva, O.; Skrbek, M.; Kordík, P.

Year

2013

Published

Proceedings of the World Congress on Engineering 2013. Hong Kong: Newswood Limited - International Association of Engineers, 2013. pp. 286-290. ISSN 2078-0958. ISBN 978-988-19251-0-7.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science
Department of Computer Systems

Annotation

Transforming of text documents to real vectors is an essential step for text mining tasks such as classification, clustering and information retrieval. The extracted vectors serve as inputs for data mining models. Large vocabularies of natural languages imply a high dimensionality of input vectors; hence a substantial dimensionality reduction has to be made. We propose a new approach to a vector representation of text documents. Our representation takes into account an order of latent topics that generate observed words; an extracted document vector includes information about the adjacency of words in a document. We experimentally proved that the proposed representation enables to build document classifiers of higher accuracy using shorter document vectors. Short but informative document vectors enable to save memory for storing data, to use simpler models that learn faster and to significantly reduce an overfit effect.

Contextual latent semantic networks used for document classification

Authors

Háva, O.; Skrbek, M.; Kordík, P.

Year

2012

Published

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Porto: SciTePress - Science and Technology Publications, 2012. pp. 425-430. ISBN 978-989-8565-29-7.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science
Department of Computer Systems

Annotation

Widely used document classifiers are developed over a bag-of-words representation of documents. Latent semantic analysis based on singular value decomposition is often employed to reduce the dimensionality of such representation. This approach overlooks word order in a text that can improve the quality of classifier. We propose language independent method that records the context of particular word into a context network utilizing products of latent semantic analysis. Words' contexts networks are combined to one network that represents a document. A new document is classified based on a similarity between its network and training documents networks. The experiments show that proposed classifier achieves better performance than common classifiers especially when a foregoing reduction of dimensionality is significant.

Document Classification with Supervised Latent Feature Selection

Authors

Háva, O.; Skrbek, M.; Kordík, P.

Year

2012

Published

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2012. pp. 70-74. ISBN 978-1-4503-0915-8.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science
Department of Computer Systems

Annotation

The classification of text documents generally deals with large dimensional data. To favor generality of classification researcher has to apply a dimensionality reduction technique before building a classifier. We propose classification and reduction algorithm that makes use of latent uncorrelated topics extracted from training documents and their known categories. We suggest several latent feature selection options and provide their testing.

Max-min ant system with linear memory complexity

Authors

Kovářík, O.; Kordík, P.

Year

2012

Published

Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2012, Brisbane, Australia, June 10-15, 2012. New York: IEEE, 2012. pp. 1-5. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6241678. ISBN 978-1-4673-1509-8.

Type

Proceedings paper

DOI

10.1109/CEC.2012.6256528

Departments

Department of Theoretical Computer Science

Annotation

We improved memory complexity of Max-Min Ant System algorithm, which is one of the best performing Ant Colony Optimization algorithms. Standard implementation uses n2 matrix of artificial pheromone to solve the Traveling Salesman Problem, where n is the number of cities. We show, that only small fraction of values in the pheromone matrix is changed from a global value and so the matrix can be replaced by an array of hash tables in order to reduce the memory complexity. This improvement is very useful in case of parallelization on multicore architectures, where frequent transfers of random parts of matrices cause radical slowdown. Also it enables the algorithm to be competitive with others when solving large instances.

On performance of Meta-learning Templates on Different Datasets

Authors

Kordík, P.; Černý, J.

Year

2012

Published

The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, June 10-15, 2012. New York: IEEE, 2012. pp. 1-7. ISSN 1098-7576. ISBN 978-1-4673-1490-9.

Type

Proceedings paper

DOI

10.1109/IJCNN.2012.6252379

Departments

Department of Theoretical Computer Science

Annotation

Meta-learning templates are data-tailored algo- rithms that produce supervised models. When a template is evolved on a particular dataset, it is supposed to generate good models not only on this data set but also on similar data. In this paper, we will investigate one possible way of measuring the similarity of datasets and whether it can be used to estimate if meta-learning templates produce good models. We performed experiments on several well known data sets from the UCI machine learning repository and analyzed both the similarity of datasets and templates in the space of performance meta- features (landmarking). Our results show that the most universal algorithms (in terms of average performance) for supervised learning are the complex hierarchical templates evolved by our SpecGen approach.

Selecting Representative Data Sets

Authors

Borovička, T.; Jiřina, M.; Kordík, P.; Jiřina, M.

Year

2012

Published

Advances in Data Mining Knowledge Discovery and Applications. Rijeka: InTech - Open Access Company (InTech Europe), 2012. p. 43-66. ISBN 978-953-51-0748-4.

Type

Book chapter

DOI

10.5772/50787

Departments

Department of Theoretical Computer Science

Annotation

Many methods of Data Mining use data sets for setting their parameters, particularly training and testing sets. Setting of parameters corresponds to the learning (training) of the methods. It is e.g. a case of artificial neural networks and other adaptive (iterative) methods. Some of these methods utilize so-called validation set as well. A question that can arise is how to correctly divide or other way preprocess a given data set to these sets, i.e. how select data samples from the original set and place them into the training and testing sets. The chapter focuses on an overview of existing methods that deal with methods of data selection and sampling. A general approach to the problem of data selection to training, testing and eventually validation sets is discussed. To be able to compare individual approaches, model evaluation techniques are discussed as well. Data splitting is one of used approaches to construct training, testing and possibly validation sets, but there are many other approaches to proper data preparation. Instance selection methods and class balancing methods belong to these approaches and are discussed in the chapter. An overview of the methods together with their features, utilization, positives and negatives is given as well. Selected main principles are illustrated in figures.

Self-organization of Supervised Models

Authors

Kordík, P.; Černý, J.

Year

2011

Published

Meta-Learning in Computational Intelligence. London: Springer, 2011. p. 179-223. Studies in Computational Intelligence. vol. 358. ISSN 1860-949X. ISBN 978-3-642-20979-6.

Type

Book chapter

DOI

10.1007/978-3-642-20980-2_6

Departments

Department of Theoretical Computer Science

Annotation

The cornerstone of successful data mining is to choose a suitable modelling algorithm for given data. Recent results show that the best performance can be achieved by an efficient combination of models or classifiers. The increasing popularity of combination (ensembling, blending) of diverse models has been significantly influenced by its success in various data mining competitions.

Self-organization of Supervised Models

Authors

Kordík, P.; Černý, J.

Year

2011

Published

Meta-Learning in Computational Intelligence. London: Springer, 2011. p. 179-223. Studies in Computational Intelligence. vol. 358. ISSN 1860-949X. ISBN 978-3-642-20979-6.

Type

Book chapter

DOI

10.1007/978-3-642-20980-2_6

Departments

Department of Theoretical Computer Science

Annotation

The cornerstone of successful data mining is to choose a suitable modelling algorithm for given data. Recent results show that the best performance can be achieved by an efficient combination of models or classifiers. The increasing popularity of combination (ensembling, blending) of diverse models has been significantly influenced by its success in various data mining competitions.

Using Interactive Evolution for Exploratory Data Analysis

Authors

Řehořek, T.; Kordík, P.

Year

2011

Published

Proceedings of the 6th International Scientific and Technical Conference (CSIT'2011). Lviv: Lviv Polytechnic National University, 2011. pp. 131-135. ISBN 978-966-2191-04-2.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Multivariate data are difficult to explore. The most popular linear projection techniques mapping data to 2-dimensional space often fail to reveal the patterns of interest. Non-linear mapping techniques are both slow and inefficient. In this paper, we propose a heuristic that allows users to adjust the parameters of mapping techniques just by stating their preferences iteratively. The preliminary results on real-world dataset demonstrate the power of our approach.

Advanced ensemble strategies for polynomial models

Authors

Černý, J.; Kordík, P.

Year

2010

Published

Proceeding of the third international conference on inductive modelling. Kyjev: National Academy of Sciences of Ukraine, 2010. pp. 77-82.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Recently, ensemble methods proved to improve accuracy of weak learners and reduce overfitting of models with high plasticity. In this paper, we experiment with various state of the art ensemble strategies applied to polynomial models. We also explore the efficiency of ensembling, when applied to polynomial models with increasing plasticity.

Application of Feature Selection Methods in Age Prediction

Authors

Pilný, A.; Kordík, P.; Šnorek, M.; Kubelková, Radka

Year

2010

Published

Proceeding of the third international conference on inductive modelling. Kyjev: National Academy of Sciences of Ukraine, 2010. pp. 167-171.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In an age prediction problem is difficult to get good results for adults. Almost all people over twenty years of age have finalized the teeth mineralization phase. Therefore the age prediction for adults needs special methods for age determining. On the other hand, children have each tooth in one of several mineralization phases and the process of prediction is more precise. In this paper we present successful method for children age prediction on the basis of teeth mineralization (MFH method). For the age prediction we used a data mining software FAKE-GAME which uses for mining a special hybrid artificial neural network called GAME. This method provides two important approaches for the age prediction. It incorporates feature selection mechanism which selects only the most important teeth. It also derives feature ranking (selected teeth ranking) by using FeRaNGA method during a building phase of the GAME network - data mining model.

Continuous Optimization Algorithms: Performance on Benchmarking Functions and Model Parameters

Authors

Kordík, P.; Bičík, V.

Year

2010

Published

Proceedings of the 7th EUROSIM Congress on Modelling and Simulation, Vol. 2: Full Papers. Prague: Department of Computer Science and Engineering, FEE, CTU in Prague, 2010. pp. 245-252. ISBN 978-80-01-04589-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Estimation of continuous parameters is frequent task in modelling and simulation. There are several general purpose algorithms available for this task. We benchmarked these algorithms in order to recommend an appropriate algorithm for our model identification problem. We present results of optimization algorithms for standard benchmarking functions and show the importance of proper parameter setting. When these algorithms are applied to the estimation of model parameters, results are quite different. For this task, the gradient (quasi-Newton) and the nature inspired method (CMAES) can be efficiently combined, achieving the best optimization performance.

GAME MODEL UTILIZATION FOR FEATURE RANKING AND FEATURE SELECTION

Authors

Pilný, A.; Kordík, P.; Šnorek, M.

Year

2010

Published

Proceedings of the Seventh International Conference on Engineering Computational Technology. Stirling: Civil-Comp Press Ltd, 2010. pp. 143-148. ISSN 1759-3433. ISBN 978-1-905088-39-3.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Almost all of Feature Ranking and Feature Selection methods are based on one of the specific approach only (e.g. statistical measure of the data, wrapper ap- proach, etc.). The problem is also in an applicability of these methods. Most of Feature Ranking and Feature Selection methods are designed for categorial data only. Some of them have also its special version for regression problems. In this paper we present modificated approach in methods for Feature Ranking and Fea- ture Selection based on a combination of statistical measure and utilization of data mining model. The main feature of this approach is that the data mining algorithm (GAME) is able to deal with numerical data as well as it can be applied to catego- rial data. It incorporates feature selection mechanism and methods, described in this paper, derive feature ranking from final data mining model.

Meta-learning approach to neural network optimization

Authors

Kordík, P.; Koutník, J.; Drchal, J.; Kovářík, O.; Čepek, M.; Šnorek, M.

Year

2010

Published

Neural Networks. 2010, 2010 (23)(4), 568-582. ISSN 0893-6080.

Type

Article

DOI

10.1016/j.neunet.2010.02.003

Departments

Department of Theoretical Computer Science

Annotation

Optimization of neural network topology, weights and neuron transfer functions for given data set and problem is not an easy task. In this article, we focus primarily on building optimal feed-forward neural network classifier for i.i.d. data sets. We apply metalearning principles to the neural network structure and function optimization. We show that diversity promotion, ensembling, self-organization and induction are beneficial for the problem. We combine several different neuron types trained by various optimization algorithms to build a supervised feedforward neural network called Group of Adaptive Models Evolution (GAME). The approach was tested on wide number of benchmark data sets. The experiments show that the combination of different optimization algorithms in the network is the best choice when the performance is averaged over several real-world problems.

New Methods for Feature Ranking

Authors

Pilný, A.; Kordík, P.; Šnorek, M.

Year

2010

Published

Workshop 2010. Praha: České vysoké učení technické v Praze, 2010. pp. 102-103. CTU Reports. ISBN 978-80-01-04513-8.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Most of Feature Ranking and Feature Selection approaches can be used for categorial data only. Some of them rely on statistical measures of the data, some are tailored to a specific data mining algorithm (wrapper approach). In this paper we present new methods for feature ranking and selection obtained as a combination of the above mentioned approaches. The data mining algorithm (GAME) is designed for numerical data, but it can be applied to categorial data as well. It incorporates feature selection mechanisms and new methods, proposed in this paper, derive feature ranking from final data mining model. The rank of each feature selected by model is computed by processing correlations of outputs between neighboring model?s neurons in different ways. We used four different methods based on fuzzy logic, certainty factors and simple calculus. The performance of these four feature ranking methods was tested on artificial data sets, on well known Ionosphere data set and on well known Housing

The Effect of Modelling Method to the Inductive Preprocessing Algorithm

Authors

Čepek, M.; Kordík, P.; Šnorek, M.

Year

2010

Published

Proceedings of 3rd International Conference on Inductive Modelling 2010. Kiev: Ukr. INTEI, 2010. pp. 131-138.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The data preprocessing is very important part of the knowledge discovery process. Data mining systems con- tains tens of preprocessing methods (for example methods for missing data imputation, data reduction, discretization, data enrichment, etc...) and usually it is not clear which methods to use. The selection of preprocessing methods appropriate for particular dataset needs strong experience and a lot of experimenting. In this paper we will test influence of modelling method which is the corner stone of Inductive Preprocessing Algorithm. Modelling method is used to evaluate evolved sequence of the preprocessing methods. In this paper we compare four modelling methods in respect to final achieved accuracy. The tested modelling methods are Polynomial model, Decision Tree, SVM and Logistic Function Classifier. To test our automatic preprocessing utilize several real-world datasets available from UCI Machine learning repository.

Building Automation Simulator and Control Strategy for Intelligent and Energy Efficient Home

Authors

Kordík, P.; Šnorek, M.; Hasaj, M.; Tvrdý, M.

Year

2009

Published

Proceedings of the European Modelling and Simulation Symposium EMS 2009. Los Alamitos: IEEE Computer Society, 2009. pp. 123-126. ISBN 978-1-4244-5345-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

An intelligent building automation system can manage many devices in order to balance the energy savings and comfort of the inhabitants. The strategy controlling devices has to be adaptive and learn to match users' needs. Based on the Amigo framework, we developed a simulator of a household. We added lights, devices, pheromone controlled inhabitants, physical model of heat loss, etc. The system runs real-time and a simple strategy was used to control heat, lights and other devices. In this simulator, we plan to evaluate several different control strategies in term of energy efficiency and user comfort. We also proposed an adaptive control strategy, based on the neural networks induced on data from sensors and user interaction signals. We built an experimental KNXbus platform demonstrating the feasibility of our concept.

Comparison of Several Classifiers to Evaluate Endocardial Electrograms Fractionation in Human

Authors

Křemen, V.; Kordík, P.; Lhotská, L.

Year

2009

Published

Proceedings of the 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway: IEEE, 2009. p. 2502-2505. ISSN 1557-170X. ISBN 978-1-4244-3295-0.

Type

Proceedings paper

DOI

10.1109/IEMBS.2009.5335161

Departments

Department of Theoretical Computer Science

Annotation

Complex fractionated atrial electrograms (CFAEs) may represent the electrophysiological substrate for atrial fibrillation (AF). Progress in signal processing algorithms to identify CFAEs sites is crucial for the development of AF ablation strategies. A novel algorithm for automated description of atrial electrograms (A-EGMs) fractionation based on wavelet transform and several statistical pattern recognition methods was proposed and new methodology of A-EGM processing was designed and tested. The algorithms for A-EGM classification were developed using normal density based classifiers, linear and high degree polynomial classifiers, nearest mean scaled classifiers, nonlinear classifiers, neural networks and j48. All classifiers were compared and tested using a representative set of 1.5 s A-EGMs (n = 68) ranked by 3 independent experts 100% coincidentialy into 4 classes of fractionation: 1 - organized atrial activity; 2 - mild; 3 - intermediate; 4 - high degree of fractionation. Feature extraction and well performing classification algorithms tested here showed maximal error of 15% and mean classification error across all implemented classifiers 9%, and the best mean classification error 5.9% (nearest mean classifier), and classification error of highly fractionated A-EGMs of approximately 9%.

Correlation-based Feature Ranking in Combination with Embedded Feature Selection

Authors

Pilný, A.; Oertel, W.; Kordík, P.; Šnorek, M.

Year

2009

Published

Proceedings of the 3rd International Workshop on Inductive Modelling 2009. Kiev: Ukr. INTEI, 2009. pp. 28-35.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Most of Feature Ranking and Feature Selection approaches can be used for categorial data only. In this paper we present new methods for feature ranking and selection obtained as a combination of the above mentioned approaches. The data mining algorithm (GAME) is designed for numerical data, but it can be applied to categorial data as well. It incorporates feature selection mechanisms and new methods, proposed in this paper, derive feature ranking from final data mining model. The rank of each feature selected by model is computed by processing correlations of outputs between neighboring model's neurons in different ways. We used four different methods based on fuzzy logic, certainty factors and simple calculus. The performance of these four feature ranking methods was tested on artificial data sets and on well known real world data sets. These methods produce ranking consistent with recently published studies.

GAME - Hybrid Self-Organizing Modeling System Based on GMDH

Authors

Kordík, P.

Year

2009

Published

Hybrid Self-Organizing Modeling Systems. Berlin: Springer, 2009. p. 233-280. Studies in Computational Intelligence. vol. 211. ISSN 1860-949X. ISBN 978-3-642-01529-8.

Type

Book chapter

Departments

Department of Theoretical Computer Science

Annotation

In this chapter, an algorithm to construct hybrid self-organizing neural network is proposed. It combines niching evolutionary strategies, nature inspired and gradient based optimization algorithms to evolve neural network with optimal topology adapted to a data set. The GAME algorithm is something in between the GMDH algorithm and the NEAT algorithm. It is capable to handle irrelevant inputs, short and noisy data samples, but also complex data such as "two intertwined spirals" problem. The selforganization of the topology allows it to produce accurate models for various tasks. Bencharking with machine learning algorithms implemented in the Weka software showed that the accuracy of GAME models was superior for both regression and classification problems. The most successful configuration of the GAME algorithm is not changing with problem character, natural evolution selects all important parameters of the algorithm. This is a significant step towards the automated data mini

Meta-optimization survey and possible applications of Inductive Approach in this field

Authors

Kordík, P.; Drchal, J.

Year

2009

Published

Proceedings of the International Conference ISDMCI 2009. Kyjev: National Academy of Sciences of Ukraine, 2009. pp. 123-132.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

In this paper we present state of the art in the field of meta-optimization. We are looking for possible applications of Inductive Approach in this field. We propose an inductive based optimization strategy combining several optimization algorithms to solve increasingly complex problems.

Methods of true data mining model selection - with experimental results

Authors

Bulgakova, O.; Kordík, P.

Year

2009

Published

Proceedings of the 3rd International Workshop on Inductive Modelling 2009. Kiev: Ukr. INTEI, 2009. pp. 23-27.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

This work presents the modeling results of different real noisy data (nalada: humanities, spirals_1 and spirals_2: complex data, motol_brain: motol hospinal neurosurgery, boshouse: house prices and also artificial data), using soft computing methods - combinatorial group method of data handling (combi GMDH) and Group of adaptive models evolution (GAME) method. The goal of our work is to get the best possible result on such noisy data and to compare the results of particular methods. Also, we combine these two methods (Game_Combi_GMDH) to get better prediction.

New Robust Method of Atrial Electrograms Classification to Identify Complex Fractionated Atrial Electrograms

Authors

Křemen, V.; Lhotská, L.; Kordík, P.

Year

2009

Published

Proceedings of the 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway: IEEE, 2009, pp. 2502-2505. ISSN 1557-170X. ISBN 978-1-4244-3295-0.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

Purpose: Complex fractionated atrial electrograms (CFAEs) may represent the electrophysiologic substrate for atrial fibrillation (AF). Progress in signal processing algorithms and pattern recognition systems to identify CFAEs sites is crucial for the development of AF ablation strategies. Methods: Head-to-head comparison of two methods for quantification of atrial electrograms (A-EGMs) with different degree of fractionation was performed using representative set of 1.5s A-EGMs (n = 113) that were blindly ranked by 3 independent experts into 4 categories: 1 - organized atrial activity; 2 - mild; 3 - intermediate; 4 - high degree of fractionation. First method (M1) assessed the average interval between discrete A-EGM spikes using algorithm previously implemented in commercially available electroanatomic mapping system.

Testing of Inductive Preprocessing Algorithm

Authors

Čepek, M.; Kordík, P.; Šnorek, M.

Year

2009

Published

Proceedings of the 3rd International Workshop on Inductive Modelling 2009. Kiev: Ukr. INTEI, 2009. pp. 13-18.

Type

Proceedings paper

Departments

Department of Theoretical Computer Science

Annotation

The data preprocessing is very important part of the knowledge discovery process. Data mining systems contains tens of preprocessing methods (for example methods for missing data imputation, data reduction, discretization, data enrichment, etc...) and usually it is not clear which methods to use. The selection of preprocessing methods appropriate for particular dataset needs strong experience and a lot of experimenting. In this paper we will test our extension of inductive approach to data preprocessing. We developed inductive preprocessing method which utilizes genetic algorithm to compose from scratch a sequence of preprocessing methods which fits to the data and allows successful model to be created. To test our automatic preprocessing utilize several real-world datasets available from UCI Machine learning repository.