The reason of writing this text is to be a kick start to a series of posts about this field that I have been studying: using Machine Learning in Security Information field.
I will jump the Machine Learning introduction and how it works because there’re a lot of resources explaining it.
There’re two types of classification which can be made by Machine Learning algorithms: supervised (if have labeled data) and unsupervised (if have not). This post will work with supervised data.
The labeled data used in this experiment was divided into benign and malware, which consists in 991 know benign files and 428 know malwares. The files used was Windows binaries and the benign files includes the binaries which comes in the Windows folder.
Among the algorithms of supervised data, the most famous are Naive Bayes, K-nearest neighbor (kNN), Support Vector Machine (SVM) and Decision Trees. We’ll work with Decision Trees.
The Decision Tree was chosen because we will work with the Random Forest classifier, which consists of a large number of decision trees that operate as an ensemble.
The classifier used works with the strings extracted from the files. Since there’re many strings, a function called FeatureHash or hashing trick was used, which consists in turns the strings in a limited hashing matrix. In the case of this experiment was limited to 20000.
So, we can use the Random Forest classifier passing the FeatureHash, the training examples (the binaries) and the labels (benign or malware). After the training is finished the detector and the hasher is saved to be used with future binaries.
After that, we can test the trained network with new binaries. As this is our implementation, we can choose the rate that will classify the file in benign or malware. The choosed rate was 65%.
First we can test with a know benign file, the Putty executable file.
Extracted 5619 strings from data/ch8/scan_file_path/putty_w64.exe
It appears this file is benign. [0.640625]
The file was classified as benign with a probability of 64% to be malicious, close but below to the 65% that we defined. Now let’s test a file which was downloaded from a phishing e-mail.
Extracted 7217 strings from data/ch8/scan_file_path/OneDrive.bin
It appears this file is malicious! [0.734375]
Now the file was classified as malicious with a probability of 73%, higher than the 65% that we defined.
As you can see, using just strings extracted from binaries is not sufficient if we want to classify binaries in benign and malware, but it can be used to make simple examples to explain the basics of how the classification works. We need to think in extract more informations if we want to construct a more robust Malware Detector with Machine Learning.
You may have noted that there’s no code here. This is because it’s just a simple introduction and, as I mentioned early, it’s a start of a series of posts that I’ll make. Another reason that I didn’t put code here it’s because in this post I just reproduced an experiment that can be found in Saxe’s and Sander’s Malware Data Science book, published by No Starch Press, and the code can be downloaded in the book’s website.