Researchers from CESNET have created and published a new dataset that is extremely valuable for understanding dynamic changes in network traffic. This unique dataset represents a major step in the use of machine learning for threat detection in network traffic. Therefore, it was accepted by the prestigious Nature Scientific Data journal.
CESNET researchers are investigating this domain in the security research project called “Flow-based encrypted traffic analysis,” which was funded by the Ministry of Interior of the Czech Republic. Although several highly innovative and accurate machine learning detectors have been developed during the project, their mass deployment is still hindered by several difficult-to-solve problems. One of them is the problem of so-called data drift phenomena, where a machine learning model was trained on data that are outdated and no longer reflect the current state.
Datasets in everyday life and how they work
You may have encountered a situation where you tried to log into your device using facial recognition (such as Apple Face ID or Windows Hello), but the device simply did not recognize you. This happened because the system was trained on your historical appearance, which may have changed—for example, there was a slight swelling of your face due to a sleepless night, or you changed your hairstyle. In this case, a data drift has occurred: the training data (your appearance) was out of date, and the verification did not work correctly.
However, biometric facial verification effectively solves the data shift problem through regular re-training. Each time the device successfully verifies your face, it updates its model to recognize you well the next time. This usually works because our appearance changes relatively slowly. However, if there is a sudden change (like a shaved beard), verification can easily fail, and some backup method (like a password) is needed instead.
The importance of datasets for the security of network traffic
A similar problem arises in cybersecurity. Unlike most common situations, the data shift in cybersecurity is usually sudden and unpredictable. New attack vectors by cybercriminals or even minor updates to the HTTPS/TLS certificates can fundamentally disrupt the machine learning performance.
In cybersecurity, we typically do not have backup detection methods (like a password serving for alternative login), so it is critical to investigate this phenomenon. Given the virtual absence of available datasets suitable for this research, researchers had limited options.
A year of network traffic in a groundbreaking dataset
The scientists from CESNET and the Faculty of Information Technology of the Czech Technical University in Prague, including Karel Hynek, Jan Luxemburk, Jaroslav Pešek, Tomáš Čejka, and Pavel Šiška, have published the unique dataset in the prestigious journal Nature Scientific Data. The dataset contains an entire year of anonymized network traffic from the backbone links of the national academic network. Until now, the scientific community has had datasets capturing a few days or a week, due to the difficulty of long-term collection and the volume of the overall data. Such a long dataset is unprecedented and a key step in addressing challenges such as data drift with negative impact on network traffic security.
The dataset helps to analyze the declining accuracy of existing algorithms and supports the development of new adaptive methods. As technology rapidly evolves, ongoing research and implementation of effective solutions are crucial to protect against cyber threats and enhance digital security. In this context, CESNET profiles itself as a leader in the field of network security thanks to the cutting-edge research. The published dataset is one of the examples of high-quality results that enable the expert community to respond effectively to current and future challenges in cybersecurity.
The dataset is available on the journal Nature Scientific Data.