Classification and online clustering of zero-day malware
Autoři
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 2024 1-14. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.
A natural language processing approach to Malware classification
Autoři
Mehta, R.; Jurečková, O.; Stamp, M.
Rok
2023
Publikováno
Journal of Computer Virology and Hacking Techniques. 2023, 2023 1-12. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
Many different machine learning and deep learning techniques have been successfully employed for malware detection and classification. Examples of popular learning techniques in the malware domain include Hidden Markov Models (HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) networks. In this research, we consider a hybrid architecture, where HMMs are trained on opcode sequences, and the resulting hidden states of these trained HMMs are used as feature vectors in various classifiers. In this context, extracting the HMM hidden state sequences can be viewed as a form of feature engineering that is somewhat analogous to techniques that are commonly employed in Natural Language Processing (NLP). We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset, with an HMM-Random Forest model yielding the best results.
Parallel Instance Filtering for Malware Detection
Autoři
Rok
2022
Publikováno
Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation.
We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES
Autoři
Rok
2022
Publikováno
Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.
Zlepšení klasifikace malwarových rodin pomocí naučené vzdálenosti pro nízké dimenze
Autoři
Rok
2022
Publikováno
Sborník příspěvků PAD 2021 Počítačové architektury & diagnostika. Liberec: Technická univerzita v Liberci, 2022. p. 23-26. ISBN 978-80-7494-592-2.
Typ
Stať ve sborníku
Pracoviště
Anotace
V tomto článku se zabýváme vybranými state-of-the-art technikami pro učení vzdálenosti, které byly použity pro problém klasifikace malwarových rodin, přičemž se zaměřujeme na nízkodimenzionální reprezentace prostoru vstupních příznaků. Cílem algoritmů pro učení vzdálenosti je najít nejvhodnější parametry vzdálenosti s ohledem na dané optimalizační kritérium. Algoritmy pro učení vzdálenosti se v našem výzkumu učí z metadat obsažených v hlavičkách spustitelných souborů v souborovém formátu Portable Executable. Na naší datové sadě bylo provedeno několik experimentů se 14 000 vzorky sestávajícími ze šesti prevalentních malwarových rodin a benigních souborů. Experimentální výsledky ukázaly, že dobré klasifikační výsledky je možné dosáhnout už i pro dvojrozměrné vektory příznaků.
Improving Classification of Malware Families using Learning a Distance Metric
Autoři
Rok
2021
Publikováno
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.