Mgr. Olha Jurečková

Classification and online clustering of zero-day malware

Autoři

Jurečková, O.; Jureček, M.; Stamp, M.; Di Troia, F.; Lórencz, R.

Rok

2024

Publikováno

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 579-592. ISSN 2263-8733.

Typ

Článek

DOI

10.1007/s11416-024-00513-5

Pracoviště

Katedra informační bezpečnosti

Anotace

A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods

Autoři

Berušková, J.; Jureček, M.; Jurečková, O.

Rok

2024

Publikováno

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 885-900. ISSN 2263-8733.

Typ

Článek

DOI

10.1007/s11416-024-00540-2

Pracoviště

Katedra informační bezpečnosti

Anotace

This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.

A natural language processing approach to Malware classification

Autoři

Mehta, R.; Jurečková, O.; Stamp, M.

Rok

2023

Publikováno

Journal of Computer Virology and Hacking Techniques. 2023, 2023 1-12. ISSN 2263-8733.

Typ

Článek

DOI

10.1007/s11416-023-00506-w

Pracoviště

Katedra informační bezpečnosti

Anotace

Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.

Parallel Instance Filtering for Malware Detection

Autoři

Jureček, M.; Jurečková, O.

Rok

2022

Publikováno

Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.

Typ

Stať ve sborníku

DOI

10.1109/SEAA56994.2022.00012

Pracoviště

Katedra informační bezpečnosti

Anotace

Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Autoři

Bielik, M.; Jureček, M.; Jurečková, O.; Lórencz, R.

Rok

2022

Publikováno

Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.

Typ

Stať ve sborníku

DOI

10.5220/0011327900003283

Pracoviště

Katedra informační bezpečnosti

Anotace

This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.

Zlepšení klasifikace malwarových rodin pomocí naučené vzdálenosti pro nízké dimenze

Autoři

Jurečková, O.

Rok

2022

Publikováno

Sborník příspěvků PAD 2021 Počítačové architektury & diagnostika. Liberec: Technická univerzita v Liberci, 2022. p. 23-26. ISBN 978-80-7494-592-2.

Typ

Stať ve sborníku

Pracoviště

Katedra informační bezpečnosti

Anotace

V tomto článku se zabýváme vybranými state-of-the-art technikami pro učení vzdálenosti, které byly použity pro problém klasifikace malwarových rodin, přičemž se zaměřujeme na nízkodimenzionální reprezentace prostoru vstupních příznaků. Cílem algoritmů pro učení vzdálenosti je najít nejvhodnější parametry vzdálenosti s ohledem na dané optimalizační kritérium. Algoritmy pro učení vzdálenosti se v našem výzkumu učí z metadat obsažených v hlavičkách spustitelných souborů v souborovém formátu Portable Executable. Na naší datové sadě bylo provedeno několik experimentů se 14 000 vzorky sestávajícími ze šesti prevalentních malwarových rodin a benigních souborů. Experimentální výsledky ukázaly, že dobré klasifikační výsledky je možné dosáhnout už i pro dvojrozměrné vektory příznaků.

Improving Classification of Malware Families using Learning a Distance Metric

Autoři

Jureček, M.; Jurečková, O.; Lórencz, R.

Rok

2021

Publikováno

Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.

Typ

Stať ve sborníku

DOI

10.5220/0010326306430652

Pracoviště

Katedra informační bezpečnosti

Anotace

The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.

Mgr. Olha Jurečková

Publikace

Classification and online clustering of zero-day malware

Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods

A natural language processing approach to Malware classification

Parallel Instance Filtering for Malware Detection

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Zlepšení klasifikace malwarových rodin pomocí naučené vzdálenosti pro nízké dimenze

Improving Classification of Malware Families using Learning a Distance Metric