Mgr. Martin Jureček, Ph.D.

A comparison of adversarial malware generators

Authors

Louthánová, P.; Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 623-639. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00519-z

Departments

Department of Information Security

Annotation

Machine learning has proven to be a valuable tool for automated malware detection, but machine learning systems have also been shown to be subject to adversarial attacks. This paper summarizes and compares related work on generating adversarial malware samples, specifically malicious Windows Portable Executable files. In contrast with previous research, we not only compare generators of adversarial malware examples theoretically, but we also provide an experimental comparison and evaluation for practical usability. We use gradient-based, evolutionary-based, and reinforcement-based approaches to create adversarial samples, which we test against selected antivirus products. The results show that applying optimized modifications to previously detected malware can lead to incorrect classification of the file as benign. Moreover, generated malicious samples can be effectively employed against detection models other than those used to produce them, and combinations of methods can construct new instances that avoid detection. Based on our findings, the Gym-malware generator, which uses reinforcement learning, has the greatest practical potential. This generator has the fastest average sample production time of 5.73 s and the highest average evasion rate of 44.11%. Using the Gym-malware generator in combination with itself further improved the evasion rate to 58.35%. However, other tested methods scored significantly lower in our experiments than reported in the original publications, highlighting the importance of a standardized evaluation environment.

Classification and online clustering of zero-day malware

Authors

Jurečková, O.; Jureček, M.; Stamp, M.; Di Troia, F.; Lórencz, R.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 579-592. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00513-5

Departments

Department of Information Security

Annotation

A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Creating valid adversarial examples of malware

Authors

Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 607-621. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00516-2

Departments

Department of Information Security

Annotation

Because of its world-class results, machine learning (ML) is becoming increasingly popular as a go-to solution for many tasks. As a result, antivirus developers are incorporating ML models into their toolchains. While these models improve malware detection capabilities, they also carry the disadvantage of being susceptible to adversarial attacks. Although this vulnerability has been demonstrated for many models in white-box settings, a black-box scenario is more applicable in practice for the domain of malware detection. We present a method of creating adversarial malware examples using reinforcement learning algorithms. The reinforcement learning agents utilize a set of functionality-preserving modifications, thus creating valid adversarial examples. Using the proximal policy optimization (PPO) algorithm, we achieved an evasion rate of 53.84% against the gradient-boosted decision tree (GBDT) detector. The PPO agent previously trained against the GBDT classifier scored an evasion rate of 11.41% against the neural network-based classifier MalConv and an average evasion rate of 2.31% against top antivirus programs. Furthermore, we discovered that random application of our functionality-preserving portable executable modifications successfully evades leading antivirus engines, with an average evasion rate of 11.65%. These findings indicate that ML-based models used in malware detection systems are sensitive to adversarial attacks and that better safeguards need to be taken to protect these systems.

Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods

Authors

Berušková, J.; Jureček, M.; Jurečková, O.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 885-900. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00540-2

Departments

Department of Information Security

Annotation

This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.

Social media bot detection using Dropout-GAN

Authors

Shukla, A.; Jureček, M.; Stamp, M.

Year

2024

Published

Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 669-680. ISSN 2263-8733.

Type

Article

DOI

10.1007/s11416-024-00521-5

Departments

Department of Information Security

Annotation

Bot activity on social media platforms is a pervasive problem, undermining the credibility of online discourse and potentially leading to cybercrime. We propose an approach to bot detection using Generative Adversarial Networks (GAN). We discuss how we overcome the issue of mode collapse by utilizing multiple discriminators to train against one generator, while decoupling the discriminator to perform social media bot detection and utilizing the generator for data augmentation. In terms of classification accuracy, our approach outperforms the state-of-the-art techniques in this field. We also show how the generator in the GAN can be used to evade such a classification technique.

Combining Generators of Adversarial Malware Examples to Increase Evasion Rate

Authors

Kozák, M.; Jureček, M.

Year

2023

Published

Proceedings of the 20th International Conference on Security and Cryptography. Porto: SciTePress - Science and Technology Publications, 2023. p. 778-786. ISSN 2184-7711. ISBN 978-989-758-666-8.

Type

Proceedings paper

DOI

10.5220/0012127700003555

Departments

Department of Information Security

Annotation

Antivirus developers are increasingly embracing machine learning as a key component of malware defense. While machine learning achieves cutting-edge outcomes in many fields, it also has weaknesses that are exploited by several adversarial attack techniques. Many authors have presented both white-box and black-box generators of adversarial malware examples capable of bypassing malware detectors with varying success. We propose to combine contemporary generators in order to increase their potential. Combining different generators can create more sophisticated adversarial examples that are more likely to evade anti-malware tools. We demonstrated this technique on five well-known generators and recorded promising results. The best-performing combination of AMG-random and MAB-Malware generators achieved an average evasion rate of 15.9% against top-tier antivirus products. This represents an average improvement of more than 36% and 627% over using only the AMG-random and MAB-Malware generators, respectively. The generator that benefited the most from having another generator follow its procedure was the FGSM injection attack, which improved the evasion rate on average between 91.97% and 1,304.73%, depending on the second generator used. These results demonstrate that combining different generators can significantly improve their effectiveness against leading antivirus programs.

Generation of Adversarial Malware and Benign Examples Using Reinforcement Learning

Authors

Kozák, M.; Jureček, M.; Lórencz, R.

Year

2022

Published

Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 3-25. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.

Type

Book chapter

DOI

10.1007/978-3-030-97087-1_1

Departments

Department of Information Security

Annotation

Machine learning is becoming increasingly popular among antivirus developers as a key factor in defence against malware. While machine learning is achieving state-of-the-art results in many areas, it also has drawbacks exploited by many with white-box attacks. Although the white-box scenario is possible in mal- ware detection, the detailed structure of antivirus is often unknown. Consequently, we focused on a pure black-box setup where no information apart from the predicted label is known to the attacker, not even the feature space or predicted score. We implemented our exploratory integrity attack using a reinforcement learning approach on a dataset of portable executable binaries. We tested multiple agent configurations while targeting LightGBM and MalConv classifiers. We achieved an evasion rate of 68.64% and 13.32% against LightGBM and MalConv classifiers, respectively. Besides traditional modelling of malware adversarial samples, we present a setup for creating benign files that can increase the targeted classifier’s false positive rate. This problem was considerably more challenging for our reinforcement learning agents, with an evasion rate of 3.45% and 36.62% against LightGBM and MalConv classifier, respectively. To understand how these attacks transfer from classifiers based purely on machine learning to real-world anti- malware software, we tested the same modified files against seven well-known antiviruses. We achieved an evasion rate of up to 47.09% in malware and 14.29% in benign adversarial attacks.

Interpretability of Machine Learning-Based Results of Malware Detection Using a Set of Rules

Authors

Dolejš, J.; Jureček, M.

Year

2022

Published

Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 107-136. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.

Type

Book chapter

DOI

10.1007/978-3-030-97087-1_5

Departments

Department of Information Security

Annotation

Machine learning plays an indispensable role in modern malware detection; it provides malware researchers with quick and reliable results. On the other hand, the results can be hard to understand as to why a model classified a given file as malicious or benign. This paper focuses on the interpretability of machine learning models’ results using decision lists generated by two rule-based classifiers, I-REP and RIPPER. We use the EMBER dataset, which contains features extracted through static analysis from Portable Executable files, to train various machine learning models. We extract decision lists from the machine learning models’ results using our implementation of I-REP and RIPPER. By taking into account accuracies, true positive and false positive rates of the decision lists, we reason whether the generated decision lists make a good representation of the results. To comprehend the interpretability of the machine learning models, we define Human Most Understandable Model and Interpretability Entropy. This allows us to measure and compare the interpretability among the models. The most interpretable machine learning model by RIPPER was Gaussian Naïve Bayes. Results show that RIPPER is relatively successful at interpreting other machine learning models; however, it needs some improvements to increase true positive rate.

Parallel Instance Filtering for Malware Detection

Authors

Jureček, M.; Jurečková, O.

Year

2022

Published

Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.

Type

Proceedings paper

DOI

10.1109/SEAA56994.2022.00012

Departments

Department of Information Security

Annotation

Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Authors

Bielik, M.; Jureček, M.; Jurečková, O.; Lórencz, R.

Year

2022

Published

Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.

Type

Proceedings paper

DOI

10.5220/0011327900003283

Departments

Department of Information Security

Annotation

This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.

Application of Distance Metric Learning to Automated Malware Detection

Authors

Jureček, M.; Lórencz, R.

Year

2021

Published

IEEE Access. 2021, 2021(9), 96151-96165. ISSN 2169-3536.

Type

Article

DOI

10.1109/ACCESS.2021.3094064

Departments

Department of Information Security

Annotation

Distance metric learning aims to find the most appropriate distance metric parameters to improve similarity-based models such as k -Nearest Neighbors or k -Means. In this paper, we apply distance metric learning to the problem of malware detection. We focus on two tasks: (1) to classify malware and benign files with a minimal error rate, (2) to detect as much malware as possible while maintaining a low false positive rate. We propose a malware detection system using Particle Swarm Optimization that finds the feature weights to optimize the similarity measure. We compare the performance of the approach with three state-of-the-art distance metric learning techniques. We find that metrics trained in this way lead to significant improvements in the k -Nearest Neighbors classification. We conducted and evaluated experiments with more than 150,000 Windows-based malware and benign samples. Features consisted of metadata contained in the headers of executable files in the portable executable file format. Our experimental results show that our malware detection system based on distance metric learning achieves a 1.09 % error rate at 0.74 % false positive rate (FPR) and outperforms all machine learning algorithms considered in the experiment. Considering the second task related to keeping minimal FPR, we achieved a 1.15 % error rate at only 0.13 % FPR.

Improving Classification of Malware Families using Learning a Distance Metric

Authors

Jureček, M.; Jurečková, O.; Lórencz, R.

Year

2021

Published

Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.

Type

Proceedings paper

DOI

10.5220/0010326306430652

Departments

Department of Information Security

Annotation

The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.

Representation of PE Files using LSTM Networks

Authors

Jureček, M.; Kozák, M.

Year

2021

Published

Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 516-525. ISSN 2184-4356. ISBN 978-989-758-491-6.

Type

Proceedings paper

DOI

10.5220/0010257105160525

Departments

Department of Theoretical Computer Science
Department of Information Security

Annotation

An ever-growing number of malicious attacks on IT infrastructures calls for new and efficient methods of protection. In this paper, we focus on malware detection using the Long Short-Term Memory (LSTM) as a preprocessing tool to increase the classification accuracy of machine learning algorithms. To represent the malicious and benign programs, we used features extracted from files in the PE file format. We created a large dataset on which we performed common feature preparation and feature selection techniques. With the help of various LSTM and Bidirectional LSTM (BLSTM) network architectures, we further transformed the collected features and trained other supervised ML algorithms on both transformed and vanilla datasets. Transformation by deep (4 hidden layers) versions of LSTM and BLSTM networks performed well and decreased the error rate of several state-of-the-art machine learning algorithms significantly. For each machine learning algorithm considered in our experiments, the LSTM-based transformation of the feature space results in decreasing the corresponding error rate by more than 58.60 %, in comparison when the feature space was not transformed using LSTM network.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

Authors

Jureček, M.; Lórencz, R.

Year

2020

Published

Proceedings of the 6th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2020. p. 725-732. ISSN 2184-4356. ISBN 978-989-758-399-5.

Type

Proceedings paper

DOI

10.5220/0009180807250732

Departments

Department of Information Security

Annotation

Distance metric learning is concerned with finding appropriate parameters of distance function with respect to a particular task. In this work, we present a malware detection system based on static analysis. We use k-nearest neighbors (KNN) classifier with weighted heterogeneous distance function that can handle nominal and numeric features extracted from portable executable file format. Our proposed approach attempts to specify the weights of the features using particle swarm optimization algorithm. The experimental results indicate that KNN with the weighted distance function improves classification accuracy significantly.

Automatic malicious code detection

Authors

Jureček, M.

Year

2019

Published

Sborník příspěvků PAD 2019 – elektronická verze. Praha: AMCA spol. s r.o., 2019. p. 35-38. ISBN 978-80-88214-20-5.

Type

Proceedings paper

Departments

Department of Information Security

Annotation

The problem of automatic malware detection is nowadays big challenges for antivirus companies. Every day an increasing number of new unknown samples are generated malicious code that needs to be detected in a timely manner. because manual analysis of samples due to the enormous amount is not maybe, you need to use automatic malware detection. algorithms machine learning is proving to be a suitable tool that is up to able to adapt and detect even new ones, to date unknown samples.

Side-Channel Attack on the A5/1 Stream Cipher

Authors

Jureček, M.; Buček, J.; Lórencz, R.

Year

2019

Published

Proceedings of the 22nd Euromicro Conference on Digital Systems Design. Los Alamitos, CA: IEEE Computer Soc., 2019. p. 633-638. ISBN 978-1-7281-2862-7.

Type

Proceedings paper

DOI

10.1109/DSD.2019.00099

Departments

Department of Information Security

Annotation

In this paper we present cryptanalysis of the A5/1 stream cipher used in GSM mobile phones. Our attack is based on power analysis where we assume that the power consumption while clocking 3 LFSRs is different than when clocking 2 LFSRs. We demonstrate a simple power analysis (SPA) attack and discuss existing differential power analysis (DPA). We present the attack for recovering secret key based on the information on clocking bits of LFSRs that was deduced from power analysis. The attack has a 100% success rate, requires minimal storage and it does not requires any single bit of a keystream. An average time complexity of our attack based on SPA is around 233 where the computation unit is a resolution of system of linear equations over the Z2. Recovering the secret key using information from the DPA has a constant complexity.

Malware Detection Using a Heterogeneous Distance Function

Authors

Jureček, M.; Lórencz, R.

Year

2018

Published

Computing and Informatics. 2018, 37(3), 759-780. ISSN 1335-9150.

Type

Article

DOI

10.4149/cai_2018_3_759

Departments

Department of Information Security

Annotation

Classication of automatically generated malware is an active research area. The amount of new malware is growing exponentially and since manual in- vestigation is not possible, automated malware classication is necessary. In this paper, we present a static malware detection system for the detection of unknown malicious programs which is based on combination of the weighted k-nearest neigh- bors classier and the statistical scoring technique from. We have extracted the most relevant features from portable executable (PE) le format using gain ratio and have designed a heterogeneous distance function that can handle both linear and nominal features. Our proposed detection method was evaluated on a dataset with tens of thousands of malicious and benign samples and the experimental re- sults show that the accuracy of our classier is 98.80%. In addition, preliminary results indicate that the proposed similarity metric on our feature space could be used for clustering malware into families.