Ing. Dominik Soukup

soukudom@fit.cvut.cz
TH:A-954

Profile
Publications
Projects
Theses

Publications

All publications

Evaluation of the Limit of Detection in Network Dataset Quality Assessment with PerQoDA

Authors

Wasielewska, K.; Soukup, D.; Čejka, T.; Camacho, J.

Year

2023

Published

ECML PKDD 2022: Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Cham: Springer, 2023. p. 170-185. ISSN 1865-0929. ISBN 978-3-031-23632-7.

Type

Proceedings paper

DOI

10.1007/978-3-031-23633-4_13

Departments

Department of Digital Design
Department of Applied Mathematics

Annotation

Machine learning is recognised as a relevant approach to detect attacks and other anomalies in network traffic. However, there are still no suitable network datasets that would enable effective detection. On the other hand, the preparation of a network dataset is not easy due to privacy reasons but also due to the lack of tools for assessing their quality. In a previous paper, we proposed a new method for data quality assessment based on permutation testing. This paper presents a parallel study on the limits of detection of such an approach. We focus on the problem of network flow classification and use well-known machine learning techniques. The experiments were performed using publicly available network datasets.

QoD: Ideas about Evaluating Quality of Datasets

Authors

Soukup, D.; Hynek, K.; Čejka, T.

Year

2020

Published

Proceedings of the 8th Prague Embedded Systems Workshop. Praha: Czech Technical University in Prague, 2020. p. 8-9. ISBN 978-80-01-06772-7.

Type

Proceedings paper

Departments

Department of Digital Design

Annotation

Importance of computer networks is raising every year. The reason is that we are connecting more and more devices, applications and our daily routines depends on connectivity. On the other hand, this is a great potential for attackers. They can hide their activities in complex network environment and steal valuable data. Without solid dataset, our evaluation score is misinterpreting the real score in production environment, and, therefore, proper datasets have essential role in research&development of any ML-based classifier or detector. The main motivation for this paper is to find a way how to evaluate quality of any dataset to estimate if it is good enough for ML experiments. To our best knowledge, there are only a few studies focused on quality evaluation of datasets with network traffic. For experiments, we selected datasets about DNS over HTTP (DoH) detection and URL classification problems that are already being elaborated. All metrics are calculated from dataset level. Impact of these metrics is evaluated on Random Forest (RF) model. We show results we have discovered in our datasets and ML detection modules. Eventually, we discuss possible next steps in this research.