Mgr. Martin Jureček, Ph.D.

Publikace

A comparison of adversarial malware generators

Autoři
Louthánová, P.; Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 2024 1-17. ISSN 2263-8733.
Typ
Článek
Anotace
Machine learning has proven to be a valuable tool for automated malware detection, but machine learning systems have also been shown to be subject to adversarial attacks. This paper summarizes and compares related work on generating adversarial malware samples, specifically malicious Windows Portable Executable files. In contrast with previous research, we not only compare generators of adversarial malware examples theoretically, but we also provide an experimental comparison and evaluation for practical usability. We use gradient-based, evolutionary-based, and reinforcement-based approaches to create adversarial samples, which we test against selected antivirus products. The results show that applying optimized modifications to previously detected malware can lead to incorrect classification of the file as benign. Moreover, generated malicious samples can be effectively employed against detection models other than those used to produce them, and combinations of methods can construct new instances that avoid detection. Based on our findings, the Gym-malware generator, which uses reinforcement learning, has the greatest practical potential. This generator has the fastest average sample production time of 5.73 s and the highest average evasion rate of 44.11%. Using the Gym-malware generator in combination with itself further improved the evasion rate to 58.35%. However, other tested methods scored significantly lower in our experiments than reported in the original publications, highlighting the importance of a standardized evaluation environment.

Classification and online clustering of zero-day malware

Autoři
Jurečková, O.; Jureček, M.; Stamp, M.; Di Troia, F.; Lórencz, R.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 2024 1-14. ISSN 2263-8733.
Typ
Článek
Anotace
A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Creating valid adversarial examples of malware

Autoři
Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 2024 1-15. ISSN 2263-8733.
Typ
Článek
Anotace
Because of its world-class results, machine learning (ML) is becoming increasingly popular as a go-to solution for many tasks. As a result, antivirus developers are incorporating ML models into their toolchains. While these models improve malware detection capabilities, they also carry the disadvantage of being susceptible to adversarial attacks. Although this vulnerability has been demonstrated for many models in white-box settings, a black-box scenario is more applicable in practice for the domain of malware detection. We present a method of creating adversarial malware examples using reinforcement learning algorithms. The reinforcement learning agents utilize a set of functionality-preserving modifications, thus creating valid adversarial examples. Using the proximal policy optimization (PPO) algorithm, we achieved an evasion rate of 53.84% against the gradient-boosted decision tree (GBDT) detector. The PPO agent previously trained against the GBDT classifier scored an evasion rate of 11.41% against the neural network-based classifier MalConv and an average evasion rate of 2.31% against top antivirus programs. Furthermore, we discovered that random application of our functionality-preserving portable executable modifications successfully evades leading antivirus engines, with an average evasion rate of 11.65%. These findings indicate that ML-based models used in malware detection systems are sensitive to adversarial attacks and that better safeguards need to be taken to protect these systems.

Generation of Adversarial Malware and Benign Examples Using Reinforcement Learning

Autoři
Kozák, M.; Jureček, M.; Lórencz, R.
Rok
2022
Publikováno
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 3-25. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Typ
Kapitola v knize
Anotace
Machine learning is becoming increasingly popular among antivirus developers as a key factor in defence against malware. While machine learning is achieving state-of-the-art results in many areas, it also has drawbacks exploited by many with white-box attacks. Although the white-box scenario is possible in mal- ware detection, the detailed structure of antivirus is often unknown. Consequently, we focused on a pure black-box setup where no information apart from the predicted label is known to the attacker, not even the feature space or predicted score. We implemented our exploratory integrity attack using a reinforcement learning approach on a dataset of portable executable binaries. We tested multiple agent configurations while targeting LightGBM and MalConv classifiers. We achieved an evasion rate of 68.64% and 13.32% against LightGBM and MalConv classifiers, respectively. Besides traditional modelling of malware adversarial samples, we present a setup for creating benign files that can increase the targeted classifier’s false positive rate. This problem was considerably more challenging for our reinforcement learning agents, with an evasion rate of 3.45% and 36.62% against LightGBM and MalConv classifier, respectively. To understand how these attacks transfer from classifiers based purely on machine learning to real-world anti- malware software, we tested the same modified files against seven well-known antiviruses. We achieved an evasion rate of up to 47.09% in malware and 14.29% in benign adversarial attacks.

Interpretability of Machine Learning-Based Results of Malware Detection Using a Set of Rules

Autoři
Dolejš, J.; Jureček, M.
Rok
2022
Publikováno
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 107-136. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Typ
Kapitola v knize
Anotace
Machine learning plays an indispensable role in modern malware detection; it provides malware researchers with quick and reliable results. On the other hand, the results can be hard to understand as to why a model classified a given file as malicious or benign. This paper focuses on the interpretability of machine learning models’ results using decision lists generated by two rule-based classifiers, I-REP and RIPPER. We use the EMBER dataset, which contains features extracted through static analysis from Portable Executable files, to train various machine learning models. We extract decision lists from the machine learning models’ results using our implementation of I-REP and RIPPER. By taking into account accuracies, true positive and false positive rates of the decision lists, we reason whether the generated decision lists make a good representation of the results. To comprehend the interpretability of the machine learning models, we define Human Most Understandable Model and Interpretability Entropy. This allows us to measure and compare the interpretability among the models. The most interpretable machine learning model by RIPPER was Gaussian Naïve Bayes. Results show that RIPPER is relatively successful at interpreting other machine learning models; however, it needs some improvements to increase true positive rate.

Parallel Instance Filtering for Malware Detection

Rok
2022
Publikováno
Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.
Typ
Stať ve sborníku
Anotace
Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Autoři
Rok
2022
Publikováno
Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.
Typ
Stať ve sborníku
Anotace
This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.

Application of Distance Metric Learning to Automated Malware Detection

Rok
2021
Publikováno
IEEE Access. 2021, 2021(9), 96151-96165. ISSN 2169-3536.
Typ
Článek
Anotace
Distance metric learning aims to find the most appropriate distance metric parameters to improve similarity-based models such as k -Nearest Neighbors or k -Means. In this paper, we apply distance metric learning to the problem of malware detection. We focus on two tasks: (1) to classify malware and benign files with a minimal error rate, (2) to detect as much malware as possible while maintaining a low false positive rate. We propose a malware detection system using Particle Swarm Optimization that finds the feature weights to optimize the similarity measure. We compare the performance of the approach with three state-of-the-art distance metric learning techniques. We find that metrics trained in this way lead to significant improvements in the k -Nearest Neighbors classification. We conducted and evaluated experiments with more than 150,000 Windows-based malware and benign samples. Features consisted of metadata contained in the headers of executable files in the portable executable file format. Our experimental results show that our malware detection system based on distance metric learning achieves a 1.09 % error rate at 0.74 % false positive rate (FPR) and outperforms all machine learning algorithms considered in the experiment. Considering the second task related to keeping minimal FPR, we achieved a 1.15 % error rate at only 0.13 % FPR.

Improving Classification of Malware Families using Learning a Distance Metric

Rok
2021
Publikováno
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.
Typ
Stať ve sborníku
Anotace
The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.

Representation of PE Files using LSTM Networks

Autoři
Jureček, M.; Kozák, M.
Rok
2021
Publikováno
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 516-525. ISSN 2184-4356. ISBN 978-989-758-491-6.
Typ
Stať ve sborníku
Anotace
An ever-growing number of malicious attacks on IT infrastructures calls for new and efficient methods of protection. In this paper, we focus on malware detection using the Long Short-Term Memory (LSTM) as a preprocessing tool to increase the classification accuracy of machine learning algorithms. To represent the malicious and benign programs, we used features extracted from files in the PE file format. We created a large dataset on which we performed common feature preparation and feature selection techniques. With the help of various LSTM and Bidirectional LSTM (BLSTM) network architectures, we further transformed the collected features and trained other supervised ML algorithms on both transformed and vanilla datasets. Transformation by deep (4 hidden layers) versions of LSTM and BLSTM networks performed well and decreased the error rate of several state-of-the-art machine learning algorithms significantly. For each machine learning algorithm considered in our experiments, the LSTM-based transformation of the feature space results in decreasing the corresponding error rate by more than 58.60 %, in comparison when the feature space was not transformed using LSTM network.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

Rok
2020
Publikováno
Proceedings of the 6th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2020. p. 725-732. ISSN 2184-4356. ISBN 978-989-758-399-5.
Typ
Stať ve sborníku
Anotace
Distance metric learning is concerned with finding appropriate parameters of distance function with respect to a particular task. In this work, we present a malware detection system based on static analysis. We use k-nearest neighbors (KNN) classifier with weighted heterogeneous distance function that can handle nominal and numeric features extracted from portable executable file format. Our proposed approach attempts to specify the weights of the features using particle swarm optimization algorithm. The experimental results indicate that KNN with the weighted distance function improves classification accuracy significantly.

Automatická detekcia škodlivého kódu

Autoři
Rok
2019
Publikováno
Sborník příspěvků PAD 2019 – elektronická verze. Praha: AMCA spol. s r.o., 2019. p. 35-38. ISBN 978-80-88214-20-5.
Typ
Stať ve sborníku
Anotace
Problém automatickej detekcie malware predstavuje v súčasnosti pre antivírové spoločnosti veľké výzvy. Každý deň sa vygeneruje čoraz väčšie množstvo nových neznámych vzoriek škodlivého kódu, ktoré je potrebné včas detekovať. Pretože manuálna analýza vzoriek z dôvodu ohromného množstva nie je možná, je nutné využiť automatickú detekciu malware. Algoritmy strojového učenia sa ukazujú byť vhodným nástrojom, ktorý je do určitej miery schopný adaptovať sa a detekovať aj nové, doteraz neznáme vzorky.

Side-Channel Attack on the A5/1 Stream Cipher

Rok
2019
Publikováno
Proceedings of the 22nd Euromicro Conference on Digital Systems Design. Los Alamitos, CA: IEEE Computer Soc., 2019. p. 633-638. ISBN 978-1-7281-2862-7.
Typ
Stať ve sborníku
Anotace
In this paper we present cryptanalysis of the A5/1 stream cipher used in GSM mobile phones. Our attack is based on power analysis where we assume that the power consumption while clocking 3 LFSRs is different than when clocking 2 LFSRs. We demonstrate a simple power analysis (SPA) attack and discuss existing differential power analysis (DPA). We present the attack for recovering secret key based on the information on clocking bits of LFSRs that was deduced from power analysis. The attack has a 100% success rate, requires minimal storage and it does not requires any single bit of a keystream. An average time complexity of our attack based on SPA is around 233 where the computation unit is a resolution of system of linear equations over the Z2. Recovering the secret key using information from the DPA has a constant complexity.

Malware Detection Using a Heterogeneous Distance Function

Rok
2018
Publikováno
Computing and Informatics. 2018, 37(3), 759-780. ISSN 1335-9150.
Typ
Článek
Anotace
Classication of automatically generated malware is an active research area. The amount of new malware is growing exponentially and since manual in- vestigation is not possible, automated malware classication is necessary. In this paper, we present a static malware detection system for the detection of unknown malicious programs which is based on combination of the weighted k-nearest neigh- bors classier and the statistical scoring technique from. We have extracted the most relevant features from portable executable (PE) le format using gain ratio and have designed a heterogeneous distance function that can handle both linear and nominal features. Our proposed detection method was evaluated on a dataset with tens of thousands of malicious and benign samples and the experimental re- sults show that the accuracy of our classier is 98.80%. In addition, preliminary results indicate that the proposed similarity metric on our feature space could be used for clustering malware into families.