Ing. Daniel Vašata, Ph.D.

Theses

Bachelor theses

Product review sentiment analysis in the Czech language

Author
Lukáš Langr
Year
2019
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Mgr. Petr Novák, Ph.D.
Summary
This thesis provides a closer look at the state of the art methods of representing documents for sentiment analysis tasks. As many of the recent articles only focus on either the English or the Chinese language, this thesis provides a unique evaluation of those methods from the perspective of the Czech language. We use various representations on reviews in the Czech language and perform a multiclass sentiment classification via machine learning models. Our achieved accuracy supersedes expectations and similar research articles using the same dataset in the Czech field. We believe this thesis will be a base upon which more extensive research of the possibilities of these representations will be conducted.

Normalization and smoothing of RSSI values of Bluetooth connection

Author
Filip Špaček
Year
2023
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Magda Friedjungová, Ph.D.
Summary
This thesis explores the effects of RSSI time series normalization and smoothing. It implements several different methods such as exponential smoothing, moving average smoothing, and Savitzky-Golay smoothing. It also proposes a normalization technique for compensating differences between RSSI values of distinct packet types. Proposed methods were tested on the existing approach detection model and results were compared.

Automated exploratory data analysis for binary classification using pandas profiling library

Author
Jan Čáp
Year
2023
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Magda Friedjungová, Ph.D.
Summary
This work deals with automatic data exploration with binary classification. A search of already existing solutions for automatic data exploration is performed. Furthermore, statistical tests and~methods suitable for testing the dependence of two variables are investigated. Suitable options for~data distribution visualizations are also explored. In the next section, an extension to~the~\textit{Pandas Profiling} library selected in the search is proposed. The extension specializes in~binary classification. The extension includes graphs and statistics representing the dependency of~columns on the target variable, visualization of the dependency of missing values on~the~target variable, proposed column transformations and training of the default model for target variable classification. Based on the design, an extension to the \textit{Pandas Profiling} library was implemented to speed up data exploration with binary classification.

Discussion comments analysis on Czech news websites

Author
Martin Vastl
Year
2019
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Karel Klouda, Ph.D.
Summary
This thesis is focused on the possibilities of using natural language processing methods to analyze comments on the news portal. The main goal is to compare the ability of BERT, Doc2vec, and Doc2vec with pretrained word vectors from BERT to examine the relevance between the comments and the content of an article from a news portal. Another goal is to use the text vector representations to detect anomalies via the Local outlier factor method. It was found by experiments, that the best model for text representation is BERT and that the pretrained word vectors have no positive impact on results in comparison of Doc2vec without pretrained vectors. Moreover, the Local outlier factor can detect anomaly comments and users when using vectors from BERT in contrast to Doc2vec text representations which are not good enough for anomaly detection and therefore often returns incorrect results.

Media articles tracking and evaluation

Author
Peter Kanoš
Year
2018
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Mgr. Jan Starý, Ph.D.
Summary
This thesis deals with implementation of application for collection of the articles and their version in the time from czech news servers iDnes.cz and Aktuality.cz. The analysis is subsequently done by Doc2Vec. Analysis of these articles is focused on changes during the time and comparison of similarities between their sections. The changes refer to titles of the articles, perexes of the articles,text of the articles. Examined were mainly relations between differents factors such as time of publication of the article, article's main issues etc. The result of the thesis is an application written in the Python language.

Unsupervised machine translation between Czech and German language

Author
Ivana Kvapilíková
Year
2020
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Karel Klouda, Ph.D.
Summary
Recent research has shown that it is possible to design a model that learns to translate entirely from monolingual texts. Even though the translation quality still lags behind the state-of-the art models trained on texts translated by humans, this line of research opens new doors for low-resource language pairs. This thesis provides an overview of unsupervised techniques for machine translation applicable in low-resource conditions. We apply the most promising approaches and compare their performance on the Czech-German language pair. Since the proposed methods depend on vector representations of words in a cross-lingual space, we experiment with these representations to show how much language-neutral information they carry.

Web application for ordinal encoding of string variables in data files

Author
Miroslav Duka
Year
2014
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Daniel Dombek, Ph.D.

Web demonstration of basic statistical calculations based on mathematical software R and SAGE

Author
Jana Ernekerová
Year
2015
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Mgr. Rudolf Bohumil Blažek, Ph.D.
Summary
This thesis deals with options of the integration open source calculating al- gebraic systems R and Sage into the web application. The connection of R software into the web application was done using API provided by OpenCPU project, the connection of Sage was done with Sage Cell Server service. Both selected algebraic systems were successfully used in the web application built on PHP language. The result is simple web application for basic statistical cal- culations. The main contribution of this thesis is the analysis of the possibility of using software R and Sage in the web applications and their comparison in terms of ease of integration, effectivity and practical applicability.

Analysis of discussion comments and their authors in social media

Author
Martin Koucký
Year
2020
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Karel Klouda, Ph.D.
Summary
It is possible that among social media users there exist unknown clusters of users and anomalous users. This work explores that possibility by analyzing users represented by their comments. We find suitable sources of data on social media sites and download them. Then, we propose vector representations of users based on their comments. Finally, we try to explain the clusters of users and anomalous users using various attributes on social media sites and with manual analysis. Our results didn't prove the existence of clusters or anomalies among social media users, because there wasn't a clear distinction between normal and anomalous users or users of different clusters. This may have been caused by insufficient methods of representing users or manual analysis. But it may also mean, that there are no such clusters of users or anomalies commenting in a similar way to be found.

Analysis and prediction of blood glucose dynamics using Machine learning techniques

Author
Ladislav Floriš
Year
2022
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Ivo Petr, Ph.D.
Summary
This bachelor thesis tries to address the problem of predicting blood glucose (BG) levels of type 1 diabetes (T1D) patients. In our work, we first analyze BG dynamics and then research and evaluate suitable models for its prediction. We focused on models based on artificial neural networks, and support vector machines. These models were experimentally evaluated on 30-minute, 1-hour, and 2-hours prediction horizons. The data used in this thesis was collected by one patient for 128 days in free-living conditions and contains BG levels, insulin doses, carbohydrate intake, and physical activity. Model performance was assessed using Root Mean Square Error (RMSE). Clarke error grid analysis was used to measure clinical accuracy. The best RMSE achieved was 17,06 mg/dl, 24,32 mg/dl, and 27,11 mg/dl respectively for 30-minute, 1-hour, and 2-hours prediction horizons. Our results show that it is possible to develop models for BG prediction which perform well in free-living conditions. Unlike most of the other papers in the academic literature on BG prediction, we used a longer dataset containing over 4 months' worth of data for a single patient. Lastly, we made this dataset publicly available for further research in this area.

Web application supporting timetabling for part-time students

Author
Jiří Hanuš
Year
2013
Type
Bachelor thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Karel Klouda, Ph.D.

Master theses

Sentiment Analysis using Domain Specific Adapters

Author
Lukáš Langr
Year
2022
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
doc. Ing. Štěpán Starosta, Ph.D.
Summary
Natural language processing has become a domain of large pre-trained models requiring a great deal of computing power to adjust to a custom task. In this work a different transfer learning method of domain specific adapters is proposed for the task of sentiment analysis. The adapted models are compared to a fine-tuning baseline in multiple experimental scenarios and their performance is comparable to considerably larger models while being much less computationally intensive. This approach looks to be a viable alternative to large models in lower computing power environments.

Learning methods for continuous-time hidden Markov models

Author
Lukáš Lopatovský
Year
2017
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Tomáš Šabata
Summary
The continuous-time hidden Markov model is promising not only for the biomedical research. The lack of efficient learning algorithms has limited its use in the past. However, recently the new efficient EM approaches were presented. In this thesis we are examining and comparing current state-of-the-art methods that are able to train models containing hundreds of hidden states. As the part of the work we have developed the general purpose continuous-time and discrete-time hidden Markov model library effectively implementing the best performing learning methods that is easy to use and available for everyone under open-source license.

Suspicion of corruption rating of contracts published in the government contracts registry

Author
Jan Staněk
Year
2018
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Marek Sušický
Summary
This master's thesis describes the design of metrics for identification suspicious contracts published in the register of contracts. It describes public data sources suitable for supplement data from the register of contracts, data integration and feature selection for anomaly detection. Designed metrics simplifies selection of contracts suitable for manual review.

The use of Relative Goodness of Fit Tests for training Generative Adversarial Networks

Author
Martin Scheubrein
Year
2021
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Magda Friedjungová, Ph.D.
Summary
Generative adversarial networks (GAN) are a class of deep learning methods which are usually applied to images or other high-dimensional data. With such data, it is difficult to decide if the distribution learnt by a model matches the distribution of source data, or to locate the differences. To measure those discrepancies, maximum mean discrepancy (MMD) or unnormalized mean embedding (UME) measures may be used. This thesis verifies that with proper parametrization, both measures reliably detect both global and local discrepancies in image data. Choice of kernel, its parameters, and in the case of UME the selection of test locations, are studied in detail. Interpretability of optimized test locations in the context of local difference discovery is verified. Finally, a novel method of early stopping based on MMD and UME measured between the network's output and testing data is proposed.

Deep Reinforcement Learning for Super Mario Bros

Author
Ondřej Schejbal
Year
2022
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Mgr. Petr Novák, Ph.D.
Summary
Within this master's thesis, a fine-tuned reinforcement learning model capable of preparing an intelligent agent able to play the Super Mario Bros. game has been created. Its architecture is based on conducted research on current state-of-the-art reinforcement learning techniques where the most relevant models for this type of task have been compared between each other. In order to compare the models, research and description of tools that allow the model to interact with the game had been done. Based on the comparison results, the most suitable approach was selected. Experiments with applying various modifications to the selected model have been done in order to find the most suitable modifications for the Super Mario Bros. game. The fine-tuned model has been used to train an intelligent agent, whose performances were tested on the level he was trained on and also on two levels that he had never seen before. The agent's performances were really good and showed nice behavioral patterns, mainly on the level he was trained on, as his performance on the unseen levels was understandably worse.

Curriculum Learning of Neural Networks

Author
Gary Fibiger
Year
2020
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Magda Friedjungová
Summary
Artificial neural networks are usually trained by observing samples from a training set in a random order. This approach is similar to biological organisms, but their learning process is hardly ever random. Human supervised learning utilizes a curriculum that leads the learning process. Many approaches were proposed to introduce a curriculum to artificial neural networks training in recent years. This thesis provides an overview of those approaches. Many of the approaches were implemented and experimentally evaluated. The results show that different approaches are favorable under different circumstances.

Recurrent Memory Models with Optimal Polynomial Projections

Author
Ondřej Naňka
Year
2021
Type
Master thesis
Supervisor
Ing. Daniel Vašata, Ph.D.
Reviewers
Ing. Karel Klouda, Ph.D.
Summary
The aim of this thesis is to research the practical usability of high-order polynomial projection operators for compression of signals by projection onto polynomial bases for implementation of recurrent neural networks. Experiments in the field of sound classification and natural language processing are performed using Tensorflow framework and also as a spiking neural network using a simulator NengoDL.