Academics from FIT CTU Publish Article in Nature Scientific Data

A team of scientists from the CESNET association and the Faculty of Information Technology at the Czech Technical University (FIT CTU) in Prague, consisting of Ing. Karel Hynek, Ph.D., Ing. Jan Luxemburk, Ing. Jaroslav Pešek, Assoc. Prof. Ing. Tomáš Čejka, Ph.D., and Ing. Pavel Šiška, has created and published a unique dataset in the prestigious journal Nature Scientific Data. The published dataset is unprecedented, as it includes a full year of anonymized network traffic from the backbone lines of the national academic network. Until now, datasets capturing just a few days or weeks of data had been available due to the challenges of long-term data collection and the sheer volume of data.

The importance of machine learning models for detecting security threats in computer networks has long been recognized by both the scientific and professional communities. Researchers from CESNET are investigating the use of machine learning methods on network traffic in the project "Analysis of Encrypted Traffic Using Network Flows." Although several highly innovative and accurate machine learning detectors have already been developed within the project, their widespread deployment is still hindered by several difficult-to-solve problems. One of the most frequently mentioned is the issue of "data drift"—a phenomenon where a machine learning model, developed using outdated data, no longer reflects the current state.

"Machine learning models often rely on data that gradually becomes outdated. Changes in network traffic due to new attacks or services can cause models to become less accurate or even fail completely," says researcher Karel Hynek. "That’s why we created this unique dataset, capturing network traffic over an entire year. There is no comparable dataset due to the complexity of its creation."

"We believe that datasets like this will help both Czech and international researchers in the field of network security. Only through research on the complex network traffic of large real-world networks can we improve threat detection algorithms to ensure they work reliably in practice," adds Tomáš Čejka.

Datasets in Everyday Life and Their Functioning

You may have encountered a situation where you tried to log into your phone or computer using facial recognition (such as Apple Face ID or Windows Hello), but the device simply didn’t recognize you. This happened because the system was trained on your historical appearance, which may have changed—for example, due to a sleepless night, your face may have slightly swollen, or you might have changed your hairstyle, which now affects your face differently. In such a case, data drift occurred; the training data (your appearance) was outdated, and verification didn’t work properly.

However, biometric facial verification effectively addresses the issue of data drift through regular retraining. Every time the device successfully verifies your face, it updates your appearance to ensure recognition the next time. This system usually works because our appearance changes relatively slowly. However, if there’s a sudden change (such as a man shaving), verification often fails, and a backup method, like entering a password, must be activated.

The Importance of Datasets for Network Traffic Security

A similar problem arises in the field of cybersecurity. Unlike most common situations, however, data drift in cybersecurity is usually sudden and unpredictable. Cybercriminals may discover new attack methods, or the deployment of new services on the network can dramatically change the nature of the traffic. Even small updates to certificates can significantly alter the nature of network data and disrupt the functionality of machine learning models.

In cybersecurity, we typically do not have backup detection methods that work 100% of the time, making it crucial to study this phenomenon. Due to the actual absence of available datasets suitable for this research, scientists have had limited options until now—fortunately, a new dataset has just been created, enabling this research.

More in the article

The person responsible for the content of this page: Bc. Veronika Dvořáková