Mgr. Olha Jurečková

Classification and online clustering of zero-day malware

Authors

Jurečková, O.; Jureček, M.; Stamp, M.; Di Troia, F.; Lórencz, R.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 579-592. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00513-5

Departments

Department of Information Security

Annotation

A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods

Authors

Berušková, J.; Jureček, M.; Jurečková, O.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 885-900. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00540-2

Departments

Department of Information Security

Annotation

This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.

A natural language processing approach to Malware classification

Authors

Mehta, R.; Jurečková, O.; Stamp, M.

Year

2023

Published

Journal of Computer Virology and Hacking Techniques. 2023, 2023 1-12. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-023-00506-w

Departments

Department of Information Security

Annotation

Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.

Improving Classification of Malware Families using a Learned Distance Metric for Low Dimensions

Authors

Jurečková, O.

Year

2022

Published

Sborník příspěvků PAD 2021 Počítačové architektury & diagnostika. Liberec: Technická univerzita v Liberci, 2022. p. 23-26. ISBN 978-80-7494-592-2.

Type

Proceedings paper

Departments

Department of Information Security

Annotation

n this paper, we discuss selected state-of-the-art distance metric learning techniques that have been used for the malware family classification problem, focusing on low-dimensional representations of the input feature space. The goal of distance metric learning algorithms is to find the most suitable distance parameters with respect to a given optimization criterion. In our research, distance metric learning algorithms learn from the metadata contained in the executable file headers in the Portable Executable file format. Several experiments were performed on our dataset with 14,000 samples consisting of six prevalent malware families and benign files. Experimental results have shown that good classification results can be achieved even for two-dimensional symptom vectors.

Parallel Instance Filtering for Malware Detection

Authors

Jureček, M.; Jurečková, O.

Year

2022

Published

Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.

Type

Proceedings paper

DOI

10.1109/SEAA56994.2022.00012

Departments

Department of Information Security

Annotation

Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Authors

Bielik, M.; Jureček, M.; Jurečková, O.; Lórencz, R.

Year

2022

Published

Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.

Type

Proceedings paper

DOI

10.5220/0011327900003283

Departments

Department of Information Security

Annotation

This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.

Improving Classification of Malware Families using Learning a Distance Metric

Authors

Jureček, M.; Jurečková, O.; Lórencz, R.

Year

2021

Published

Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.

Type

Proceedings paper

DOI

10.5220/0010326306430652

Departments

Department of Information Security

Annotation

The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.

Mgr. Olha Jurečková

Publications

Classification and online clustering of zero-day malware

Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods

A natural language processing approach to Malware classification

Improving Classification of Malware Families using a Learned Distance Metric for Low Dimensions

Parallel Instance Filtering for Malware Detection

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Improving Classification of Malware Families using Learning a Distance Metric