Mgr. Martin Jureček, Ph.D.

Publications

Classification and online clustering of zero-day malware

Authors
Jurečková, O.; Jureček, M.; Stamp, M.; Di Troia, F.; Lórencz, R.
Year
2024
Published
Journal of Computer Virology and Hacking Techniques. 2024, 2024 1-14. ISSN 2263-8733.
Type
Article
Annotation
A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Generation of Adversarial Malware and Benign Examples Using Reinforcement Learning

Authors
Kozák, M.; Jureček, M.; Lórencz, R.
Year
2022
Published
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 3-25. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Type
Book chapter
Annotation
Machine learning is becoming increasingly popular among antivirus developers as a key factor in defence against malware. While machine learning is achieving state-of-the-art results in many areas, it also has drawbacks exploited by many with white-box attacks. Although the white-box scenario is possible in mal- ware detection, the detailed structure of antivirus is often unknown. Consequently, we focused on a pure black-box setup where no information apart from the predicted label is known to the attacker, not even the feature space or predicted score. We implemented our exploratory integrity attack using a reinforcement learning approach on a dataset of portable executable binaries. We tested multiple agent configurations while targeting LightGBM and MalConv classifiers. We achieved an evasion rate of 68.64% and 13.32% against LightGBM and MalConv classifiers, respectively. Besides traditional modelling of malware adversarial samples, we present a setup for creating benign files that can increase the targeted classifier’s false positive rate. This problem was considerably more challenging for our reinforcement learning agents, with an evasion rate of 3.45% and 36.62% against LightGBM and MalConv classifier, respectively. To understand how these attacks transfer from classifiers based purely on machine learning to real-world anti- malware software, we tested the same modified files against seven well-known antiviruses. We achieved an evasion rate of up to 47.09% in malware and 14.29% in benign adversarial attacks.

Interpretability of Machine Learning-Based Results of Malware Detection Using a Set of Rules

Authors
Dolejš, J.; Jureček, M.
Year
2022
Published
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 107-136. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Type
Book chapter
Annotation
Machine learning plays an indispensable role in modern malware detection; it provides malware researchers with quick and reliable results. On the other hand, the results can be hard to understand as to why a model classified a given file as malicious or benign. This paper focuses on the interpretability of machine learning models’ results using decision lists generated by two rule-based classifiers, I-REP and RIPPER. We use the EMBER dataset, which contains features extracted through static analysis from Portable Executable files, to train various machine learning models. We extract decision lists from the machine learning models’ results using our implementation of I-REP and RIPPER. By taking into account accuracies, true positive and false positive rates of the decision lists, we reason whether the generated decision lists make a good representation of the results. To comprehend the interpretability of the machine learning models, we define Human Most Understandable Model and Interpretability Entropy. This allows us to measure and compare the interpretability among the models. The most interpretable machine learning model by RIPPER was Gaussian Naïve Bayes. Results show that RIPPER is relatively successful at interpreting other machine learning models; however, it needs some improvements to increase true positive rate.

Parallel Instance Filtering for Malware Detection

Year
2022
Published
Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.
Type
Proceedings paper
Annotation
Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.

Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES

Authors
Year
2022
Published
Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.
Type
Proceedings paper
Annotation
This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.

Application of Distance Metric Learning to Automated Malware Detection

Year
2021
Published
IEEE Access. 2021, 2021(9), 96151-96165. ISSN 2169-3536.
Type
Article
Annotation
Distance metric learning aims to find the most appropriate distance metric parameters to improve similarity-based models such as k -Nearest Neighbors or k -Means. In this paper, we apply distance metric learning to the problem of malware detection. We focus on two tasks: (1) to classify malware and benign files with a minimal error rate, (2) to detect as much malware as possible while maintaining a low false positive rate. We propose a malware detection system using Particle Swarm Optimization that finds the feature weights to optimize the similarity measure. We compare the performance of the approach with three state-of-the-art distance metric learning techniques. We find that metrics trained in this way lead to significant improvements in the k -Nearest Neighbors classification. We conducted and evaluated experiments with more than 150,000 Windows-based malware and benign samples. Features consisted of metadata contained in the headers of executable files in the portable executable file format. Our experimental results show that our malware detection system based on distance metric learning achieves a 1.09 % error rate at 0.74 % false positive rate (FPR) and outperforms all machine learning algorithms considered in the experiment. Considering the second task related to keeping minimal FPR, we achieved a 1.15 % error rate at only 0.13 % FPR.

Improving Classification of Malware Families using Learning a Distance Metric

Year
2021
Published
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.
Type
Proceedings paper
Annotation
The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.

Representation of PE Files using LSTM Networks

Authors
Jureček, M.; Kozák, M.
Year
2021
Published
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 516-525. ISSN 2184-4356. ISBN 978-989-758-491-6.
Type
Proceedings paper
Annotation
An ever-growing number of malicious attacks on IT infrastructures calls for new and efficient methods of protection. In this paper, we focus on malware detection using the Long Short-Term Memory (LSTM) as a preprocessing tool to increase the classification accuracy of machine learning algorithms. To represent the malicious and benign programs, we used features extracted from files in the PE file format. We created a large dataset on which we performed common feature preparation and feature selection techniques. With the help of various LSTM and Bidirectional LSTM (BLSTM) network architectures, we further transformed the collected features and trained other supervised ML algorithms on both transformed and vanilla datasets. Transformation by deep (4 hidden layers) versions of LSTM and BLSTM networks performed well and decreased the error rate of several state-of-the-art machine learning algorithms significantly. For each machine learning algorithm considered in our experiments, the LSTM-based transformation of the feature space results in decreasing the corresponding error rate by more than 58.60 %, in comparison when the feature space was not transformed using LSTM network.

Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection

Year
2020
Published
Proceedings of the 6th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2020. p. 725-732. ISSN 2184-4356. ISBN 978-989-758-399-5.
Type
Proceedings paper
Annotation
Distance metric learning is concerned with finding appropriate parameters of distance function with respect to a particular task. In this work, we present a malware detection system based on static analysis. We use k-nearest neighbors (KNN) classifier with weighted heterogeneous distance function that can handle nominal and numeric features extracted from portable executable file format. Our proposed approach attempts to specify the weights of the features using particle swarm optimization algorithm. The experimental results indicate that KNN with the weighted distance function improves classification accuracy significantly.

Automatic malicious code detection

Authors
Year
2019
Published
Sborník příspěvků PAD 2019 – elektronická verze. Praha: AMCA spol. s r.o., 2019. p. 35-38. ISBN 978-80-88214-20-5.
Type
Proceedings paper
Annotation
The problem of automatic malware detection is nowadays big challenges for antivirus companies. Every day an increasing number of new unknown samples are generated malicious code that needs to be detected in a timely manner. because manual analysis of samples due to the enormous amount is not maybe, you need to use automatic malware detection. algorithms machine learning is proving to be a suitable tool that is up to able to adapt and detect even new ones, to date unknown samples.

Side-Channel Attack on the A5/1 Stream Cipher

Year
2019
Published
Proceedings of the 22nd Euromicro Conference on Digital Systems Design. Los Alamitos, CA: IEEE Computer Soc., 2019. p. 633-638. ISBN 978-1-7281-2862-7.
Type
Proceedings paper
Annotation
In this paper we present cryptanalysis of the A5/1 stream cipher used in GSM mobile phones. Our attack is based on power analysis where we assume that the power consumption while clocking 3 LFSRs is different than when clocking 2 LFSRs. We demonstrate a simple power analysis (SPA) attack and discuss existing differential power analysis (DPA). We present the attack for recovering secret key based on the information on clocking bits of LFSRs that was deduced from power analysis. The attack has a 100% success rate, requires minimal storage and it does not requires any single bit of a keystream. An average time complexity of our attack based on SPA is around 233 where the computation unit is a resolution of system of linear equations over the Z2. Recovering the secret key using information from the DPA has a constant complexity.

Malware Detection Using a Heterogeneous Distance Function

Year
2018
Published
Computing and Informatics. 2018, 37(3), 759-780. ISSN 1335-9150.
Type
Article
Annotation
Classication of automatically generated malware is an active research area. The amount of new malware is growing exponentially and since manual in- vestigation is not possible, automated malware classication is necessary. In this paper, we present a static malware detection system for the detection of unknown malicious programs which is based on combination of the weighted k-nearest neigh- bors classier and the statistical scoring technique from. We have extracted the most relevant features from portable executable (PE) le format using gain ratio and have designed a heterogeneous distance function that can handle both linear and nominal features. Our proposed detection method was evaluated on a dataset with tens of thousands of malicious and benign samples and the experimental re- sults show that the accuracy of our classier is 98.80%. In addition, preliminary results indicate that the proposed similarity metric on our feature space could be used for clustering malware into families.