Today, thanks to high-quality open source software that handles the heavy mathematical lifting of implementing machine learning systems, anyone who knows basic Python and understands the key concepts can use machine learning.
In this chapter, I show you how to build machine learning malware detection systems using scikit-learn, the most popular—and the best, in my opinion—open source machine learning package available. This chapter contains a lot of sample code. The major code blocks are accessible in the directory malware_data_science/ch8/code, and the corresponding sample data is accessible in the directory malware_data_science/ch8/data in the code and data (and on the virtual machine) accompanying this book.
By following along with the text, examining the sample code, and trying out the provided examples, you should be comfortable building and evaluating your own machine learning systems by the end of the chapter. You also learn to build a general malware detector and use the necessary tools to build malware detectors for specific malware families. The skills you gain here will have a broad application, allowing you to apply machine learning to other security problems, such as detecting malicious emails or suspicious network streams.
First, you learn the terminology and concepts you need to know before using scikit-learn. Then, you use scikit-learn to implement a basic decision tree detector based on the decision tree concepts you learned in Chapter 6. Next, you learn how to integrate feature extraction code with scikit-learn to build a real-world malware detector that uses real-world feature extraction and a random forest approach to detect malware. Finally, you learn how to use scikit-learn to evaluate machine learning systems with the sample random forest detector.
Let’s review some terminology first. The open source library scikit-learn (sklearn for short) has become popular in the machine learning community because it’s both powerful and easy to use. Many data scientists use the library within the computer security community and in other fields, and many rely on it as their main toolbox for performing machine learning tasks. Although sklearn is constantly being updated with new machine learning approaches, it provides a consistent programming interface that makes using these machine learning approaches simple.
Like many machine learning frameworks, sklearn requires training data in vector form. Vectors are arrays of numbers where each index in the array corresponds to a single feature of the training example software binary. For example, if the two features of software binaries our machine learning detector uses are is compressed and contains encrypted data, then our feature vector for a training example binary could be [0,1]. Here, the first index in the vector would represent whether or not the binary is compressed, with the zero indicating “no,” and the second index would represent whether or not the binary contains encrypted data, with the one indicating “yes.”
Vectors can be awkward to work with because you have to remember what feature each index maps to. Fortunately, sklearn provides helper code that translates other data representations to vector form. For example, you can use sklearn’s DictVectorizer class to transform dictionary representations of your training data (for instance, {"is compressed":1,"contains encrypted data":0}) into the vector representation that sklearn operates on, like [0,1]. Later, you can use the DictVectorizer to recover the mapping between the vector’s indices and the original feature names.
To train an sklearn-based detector, you need to pass in two separate objects to sklearn: feature vectors (as described previously) and a label vector. A label vector contains one number per training example, which corresponds, in our case, to whether or not the example is malware or benignware. For instance, if we pass three training examples to sklearn, and then pass the label vector [0,1,0], we’re telling sklearn that the first sample is benignware, the second sample is malware, and the third is benignware. By convention, machine learning engineers use a capital X variable to represent the training data and a lowercase y variable to represent the labels. The difference in case reflects the convention in mathematics of capitalizing variables that represent matrices (which we can think of as arrays of vectors) and lowercasing variables that represent individual vectors. You’ll see this convention used in machine learning sample code online, and I use this convention for the remainder of this book to get you comfortable with it.
The sklearn framework uses other terminology that you might find new as well. Instead of calling machine learning–based detectors “detectors,” sklearn calls them “classifiers.” In this context, the term classifier simply means a machine learning system that categorizes things into two or more categories. Therefore, a detector (the term I use throughout this book) is a special type of a classifier that places things into two categories, like malware and benignware. Also, instead of using the term training, sklearn’s documentation and API often use the term fit. For example, you’ll see a sentence like “fit a machine learning classifier using training examples,” which is the equivalent to saying “train a machine learning detector using training examples.”
Finally, instead of using the term detect in the context of classifiers, sklearn uses the term predict. This term is used in sklearn’s framework, and in the machine learning community more generally, whenever a machine learning system is used to perform a task, whether to predict the value of a stock a week from now or detect whether an unknown binary is malware.
Now that you’re familiar with sklearn’s technical terminology, let’s create a simple decision tree along the lines of what we discussed in Chapter 6, using the sklearn framework. Recall that decision trees play a “20 questions” type of game in which they ask a series of questions about input vectors to arrive at a decision concerning whether these vectors are malicious or benign. We walk through building a decision tree classifier, step by step, and then explore an example of a complete program. Listing 8-1 shows how to import the requisite modules from sklearn.
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
Listing 8-1: Importing sklearn modules
The first module we import, tree, is sklearn’s decision tree module. The second module, feature_extraction, is sklearn’s helper module from which we import the DictVectorizer class. The DictVectorizer class conveniently translates the training data provided in readable, dictionary form to the vector representation that sklearn requires to actually train machine learning detectors.
After we import the modules we need from sklearn, we instantiate the requisite sklearn classes, as shown in Listing 8-2.
classifier = ➊tree.DecisionTreeClassifier()
vectorizer = ➋DictVectorizer(sparse=➌False)
Listing 8-2: Initializing the decision tree classifier and vectorizer
The first class we instantiate, DecisionTreeClassifier ➊, represents our detector. Although sklearn provides a number of parameters that control exactly how our decision tree will work, here we don’t select any parameters so that we’re using sklearn’s default decision tree settings.
The next class we instantiate is DictVectorizer ➋. We set sparse to False ➌ in the constructor, telling sklearn that we do not want it to use sparse vectors, which save memory but are complicated to work with. Because sklearn’s decision tree module can’t use sparse vectors, we turn this feature off.
Now that we have instantiated our classes, we can initialize some sample training data, as shown in Listing 8-3.
# declare toy training data
➊ training_examples = [
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
{'packed':1,'contains_encrypted':1},
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':1},
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
]
➋ ground_truth = [1,1,1,1,0,0,0,0]
Listing 8-3: Declaring training and label vectors
In this example, we initialize two structures—feature vectors and a label vector—that together comprise our training data. The feature vectors, assigned to the training_examples variable ➊, are given in dictionary form. As you can see, we’re using two simple features. The first is packed, which represents whether a given file is packed, and the second is contains_encrypted, which represents whether the file contains encrypted data. The label vector, which is assigned to the ground_truth variable ➋, represents whether each training example is malicious or benign. In this book, and in general among security data scientists, 0 always stands for benign and 1 always stands for malicious. In this case, the label vector declares that the first four feature vectors are malicious and the second four are benign.
Now that we’ve declared our training vectors and label vector, let’s train our decision tree model by calling the decision tree class instance’s fit method, as shown in Listing 8-4.
# initialize the vectorizer with the training data
➊ vectorizer.fit(training_examples)
# transform the training examples to vector form
➋ X = vectorizer.transform(training_examples)
y = ground_truth # call ground truth 'y', by convention
Listing 8-4: Initializing the vectorizer class with training data
The code in Listing 8-4 first initializes the vectorizer class that we initialized in Listing 8-2 by calling the fit method ➊. Here, the fit method tells sklearn to create a mapping between the packed feature and the contains_encrypted feature and vector array indices. Then we transform our dictionary-based feature vectors into numerical vector form by calling the vectorizer class’s transform method ➋. Recall that we assign our feature vectors to a variable called X and our label vector to a variable called y, which is the naming convention in the machine learning community.
Now that we’re all set up with our training data, we can train our decision tree detector by calling the fit method on the decision tree classifier instances, like this:
# train the classifier (a.k.a. 'fit' the classifier)
classifier.fit(X,y)
As you can see, training the sklearn detector is as simple as that. But behind the scenes, sklearn is going through the algorithmic process of identifying a good decision tree for accurately detecting whether new software is malicious or benign, along the lines of the algorithm we discussed in the previous chapter.
Now that we’ve trained the detector, let’s use the code in Listing 8-5 to detect whether a binary is malicious or benign.
test_example = ➊{'packed':1,'contains_encrypted':0}
test_vector = ➋vectorizer.transform(test_example)
➌ print classifier.predict(test_vector) # prints [1]
Listing 8-5: Determining whether a binary is malicious
Here, we instantiate a dictionary-based feature vector for a hypothetical software binary ➊, translate it to numerical vector form using vectorizer ➋, which we declared earlier in our code, and then run the decision tree detector we built ➌ to determine whether or not the binary is malicious. You’ll see when we run the code that the classifier “thinks” that the new binary is malicious (because it gives a “1” as its output), and you’ll see why this is the case when we visualize our decision tree.
We can visualize the decision tree that sklearn has automatically created based on our training data, as shown in Listing 8-6.
# visualize the decision tree
with open(➊"classifier.dot","w") as output_file:
➋ tree.export_graphviz(
classifier,
feature_names=vectorizer.get_feature_names(),
out_file=output_file
)
import os
os.system("dot classifier.dot -Tpng -o classifier.png")
Listing 8-6: Creating an image file of the decision tree using GraphViz
Here, we open a file called classifier.dot ➊ to which we write a network representation of our decision tree using the export_graphviz() function that sklearn’s tree module provides. Then we call tree.export_graphviz ➋ to write a GraphViz .dot file to classifier.dot, which writes a network representation of the decision tree to disk. Finally, we use the GraphViz dot command line program to create an image file that visualizes the decision tree, in a form that corresponds to what you learned about decision trees in Chapter 6. When you run this, you should get an output image file called classifier.png that looks like Figure 8-1.
Figure 8-1: Decision tree visualization
Although this decision tree visualization should be familiar from Chapter 6, it contains some new vocabulary. The first line in each box contains the name of the feature about which the node asks a question (in machine learning parlance, we say that the node “splits on” this feature). For example, the first node splits on the feature “packed”: if a binary is not packed, we move along the left arrow; otherwise, we move along the right arrow.
The second line of text in each box refers to that node’s gini index, which measures how much inequality there is between the malware and benignware training examples that match that node. The higher the gini index, the more skewed the samples that match that node are toward either benignware or malware. This means that a high gini index in each node is good, because the more the training examples skew toward either malware or benignware, the more sure we are about whether new test examples are malware or benignware. The third line in each box just gives the number of training examples that matched that node.
You’ll notice that in the leaf nodes of the tree, the text in the box is different. These nodes don’t “ask a question;” instead, they provide an answer to the question “is this binary malicious or benign?” For example, in the leftmost leaf node, we have “value = [2. 1.],” which means that two benign training examples matched this node (not packed and not encrypted) and one malware training example matched the node. That is, if we reach this node, we’d assign a probability of 33 percent to the binary being malware (1 malware sample / 3 total samples = 33 percent). The gini value in these boxes shows how much information is gained about whether the binary is malware or benignware when we split on the question directly leading up to these nodes. As you can see, it can be useful to inspect visualizations of decision trees generated by sklearn to understand how our decision trees are making detections.
Listing 8-7 shows the complete code for the decision tree workflow I have described thus far. This code should be easily legible to you now that we have worked through the code, piece by piece.
#!/usr/bin/python
# import sklearn modules
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
# initialize the decision tree classifier and vectorizer
classifier = tree.DecisionTreeClassifier()
vectorizer = DictVectorizer(sparse=False)
# declare toy training data
training_examples = [
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
{'packed':1,'contains_encrypted':1},
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':1},
{'packed':1,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
{'packed':0,'contains_encrypted':0},
]
ground_truth = [1,1,1,1,0,0,0,0]
# initialize the vectorizer with the training data
vectorizer.fit(training_examples)
# transform the training examples to vector form
X = vectorizer.transform(training_examples)
y = ground_truth # call ground truth 'y', by convention
# train the classifier (a.k.a. 'fit' the classifier)
classifier.fit(X,y)
test_example = {'packed':1,'contains_encrypted':0}
test_vector = vectorizer.transform(test_example)
print `classifier.predict(test_vector)` # prints [1]
#visualize the decision tree
with open("classifier.dot","w") as output_file:
tree.export_graphviz(
classifier,
feature_names=vectorizer.get_feature_names(),
out_file=output_file
)
import os
os.system("dot classifier.dot -Tpng -o classifier.png")
Listing 8-7: Complete decision tree workflow sample code
The sample machine learning malware detector we just explored demonstrates how to get started with sklearn’s functionality, but it’s missing some essential features required for a real-world malware detector. Let’s now explore what a real-world malware detector entails.
To build a real-world detector, you need to use industrial-strength features of software binaries as well as write code to extract these features from software binaries. Industrial-strength features are those that reflect the content of binaries in all their complexity, which means we need to use hundreds or thousands of features. By “extracting” features I mean that you have to write code that identifies the presence of these features within binaries. You also need to use thousands of training examples and train a machine learning model at scale. Finally, you need to use sklearn’s more advanced detection approaches because the simple decision tree approaches we just explored don’t provide sufficient detection accuracy.
The sample features I used previously, such as is packed and contains encrypted data, are simple toy examples, and these two features alone will never result in a working malware detector. As I mentioned previously, real-world malware detection systems use hundreds, thousands, or even millions of features. For example, a machine learning–based detector might use millions of character strings that occur in software binaries as features. Or it might use the values of software binary Portable Executable (PE) headers, the functions imported by a given binary, or some combination of all of these. Although we’ll work only with string features in this chapter, let’s take a moment to explore common categories of features used in machine learning–based malware detection, starting with the string features.
The string features of a software binary are all the contiguous strings of printable characters in the file that are at least some minimum length (in this book, this minimum is set to five characters). For example, suppose a binary file contains the following sequences of printable characters:
["A", "The", "PE executable", "Malicious payload"]
In this case, the strings we can use as features would be "PE executable" and "Malicious payload" because these two strings have more than five characters in them.
To transform string features into a format that sklearn can understand, we need to put them into a Python dictionary. We do this by using the actual strings as dictionary keys and then setting their values to 1 to indicate that the binary in question contains that string. For example, the previous sample binary would get a feature vector of {"PE executable": 1, "Malicious payload": 1}. Of course, most software binaries have hundreds of printable strings in them, not just two, and these strings can contain rich information about what a program does.
In fact, string features work well with machine learning–based detection because they capture so much information about software binaries. If the binary is a packed malware sample, then it’s likely to have few informative strings, which in itself can be a giveaway that the file is malicious. On the other hand, if parts of the file’s resources section are not packed or obfuscated, then those strings reveal much about the file’s behavior. For example, if the binary program in question makes HTTP requests, it’s common to see strings such as "GET %s" in that file’s set of strings.
String features have some limitations, however. For example, they don’t capture anything about the actual logic of a binary program, because they don’t include actual program code. So, although strings can be useful features even on packed binaries, they don’t reveal what packed binaries actually do. As a result, detectors based on string features are not ideal for detecting packed malware.
PE header features are extracted from the PE header metadata that resides in every Windows .exe and .dll file. For more information on the format of these headers, refer to Chapter 1. To extract PE features from static program binaries, you can use the code given in that chapter, and then encode file features in Python dictionary form, where the header field name is the dictionary key and the field value is the value corresponding to each key.
PE header features complement string features well. For example, whereas string features often do a good job of capturing the function calls and network transmissions made by a program, like the "GET %s" example, PE header features capture information like a program binary’s compile timestamp, the layout of its PE sections, and which of those sections are marked executable and how large they are on disk. They also capture the amount of memory a program allocates upon startup, and many other runtime characteristics of a program binary that string features don’t capture.
Even when you’re dealing with packed binaries, PE header features can still do a decent job of distinguishing packed malware from packed benignware. This is because although we cannot see packed binaries’ code because of obfuscation, we can still see how much space the code takes up on disk and how the binary is laid out on disk or compressed over a series of file sections. These are telling details that can help a machine learning system distinguish malware from benignware. On the downside, PE header features don’t capture the actual instructions a program executes when it is run, or the functions that it calls.
The Import Address Table (IAT), which you learned about in Chapter 1, is also an important source of machine learning features. The IAT contains a list of functions and libraries that a software binary imports from external DLL files. As such, the IAT contains important information about program behavior that you can use to complement the PE header features described in the previous section.
To use the IAT as a source of machine learning features, you need to represent each file as a dictionary of features, where the name of the imported library and function is the key, and the key maps to a 1, which indicates that the file in question contains that specific import (for example, the key "KERNEL32.DLL:LoadLibraryA", where KERNEL32.DLL is the DLL and LoadLibraryA is the function call). The feature dictionary resulting from computing IAT features in this way for a sample would look like { KERNEL32.DLL:LoadLibraryA: 1, ... }, where we’d assign a 1 to any keys observed in a binary.
In my experience building malware detectors, I have found that IAT features rarely work well on their own—although these features capture useful high-level information about program behavior, the malware often obfuscates the IAT to make itself look like benignware. Even when malware contains no obfuscation, it often imports the same DLL calls that benignware also imports, making it hard to distinguish between malware and benignware simply based on IAT information. Finally, when malware is packed (compressed or encrypted, such that the real malware code is only visible after the malware executes and uncompresses or unencrypts itself), the IAT only contains imports that the packer uses, not the imports that the malware uses. That said, when you use IAT features in conjunction with other features like PE header features and string features, they can boost system accuracy.
So far you’ve learned about machine learning features that don’t involve any concept of ordering. For example, we discussed string features to check whether or not a binary has a particular string, but not whether a particular string comes before or after another string in the layout of the binary on disk.
But sometimes order matters. For example, we might find that an important malware family imports only commonly used functions, but imports them in a very specific order, such that when we observe the functions in that order, we know we’re seeing that malware family and not benignware. To capture this kind of order information, you can use a machine learning concept called an N-gram.
N-grams sound more exotic than they are: they just involve laying out your features in the sequence in which they occur and then sliding a window of length n over the sequence, treating the sequence of features inside the window at each step as a single, aggregate feature. For example, if we had the sequence ["how", "now", "brown", "cow"] and we wanted to extract N-gram features of length 2 (n = 2) from this sequence, we would have [("how","now"), ("now","brown"), ("brown","cow")] as our features.
In the context of malware detection, some kinds of data are most naturally represented as N-gram features. For example, when you disassemble a binary into its constituent instructions, like ["inc", "dec", "sub", "mov"], it makes sense to then use the N-gram approach to capture these sequences of instructions because representing a sequence of instructions can be useful in detecting particular malware implementations. Alternatively, when you’re executing binaries to examine their dynamic behavior, you can use the N-gram approach to represent binaries’ sequences of API calls or high-level behaviors.
I recommend experimenting with N-gram features in your machine learning–based malware detection systems whenever you’re working with data that occurs in some type of sequence. It often takes some trial and error to determine what number you should set n to, which determines the length of your N-grams. This trial and error involves varying the n value to see which value yields the best accuracy on your test data. Once you find the right number, N-grams can be powerful features for capturing the actual sequential behaviors of program binaries, thereby boosting system accuracy.
Now that you know the strengths and weaknesses of different categories of features, you may be wondering why you can’t use all these features at once to build the best possible detector. There are a few reasons using all possible features is not a good idea.
First, extracting all the features we just explored takes a long time, which compromises how quickly your system can scan files. More importantly, if you use too many features on machine learning algorithms, you can run into memory issues and your system can take too long to train. This is why when building your systems, I recommend trying different features and honing in on those that work well on the kinds of malware you’re trying to detect (and the kinds of benignware you’re trying to avoid generating false positives on).
Unfortunately, even if you do home in on one category of features, like string features, you’ll often have more features than most machine learning algorithms can handle. When using string features, you must have one feature for every unique string that occurs in your training data. For example, if training sample A contains the string "hello world", and training sample B contains the string "hello world!", then you’ll need to treat "hello world" and "hello world!" as two separate features. This means that when you’re dealing with thousands of training samples, you’ll quickly encounter thousands of unique strings, and your system will end up using that many features.
To get around the problem of having too many features, you can use a popular and straightforward solution called the hashing trick, also known as feature hashing. The idea is as follows: suppose you have a million unique string features in your training set, but the machine learning algorithm and hardware you’re using can only deal with 4,000 unique features across the whole training set. You need some way of compressing a million features down to a feature vector that’s 4,000 entries long.
The hashing trick makes these million features fit within a feature space of 4,000 by hashing each feature into one of 4,000 indices. Then you add the value of your original feature to the number at that index in your 4,000-dimensional feature vector. Of course, features often collide with this approach because their values are added together along the same dimension. This might affect system accuracy because the machine learning algorithm you’re using can’t “see” the value of individual features anymore. But in practice, this degradation in accuracy is often very small, and the benefit you get from the compressed representation of your features far outweighs this slight degradation that occurs because of the compression operation.
To make these ideas clearer, I walk you through sample code that implements the hashing trick. Here I show this code to illustrate how the algorithm works; later, we’ll use sklearn’s implementation of this function. Our sample code starts with a function declaration:
def apply_hashing_trick(feature_dict, vector_size=2000):
The apply_hashing_trick() function takes two parameters: the original feature dictionary and the size of the vector we store the smaller feature vector in after we apply the hashing trick.
Next, we create the new feature array using the following code:
new_features = [0 for x in range(vector_size)]
The new_features array stores the feature information after applying the hashing trick. Then we perform the key operations of the hashing trick inside a for loop, like in Listing 8-8.
for key in ➊feature_dict:
array_index = ➋hash(key) % vector_size
new_features[array_index] += ➌feature_dict[key]
Listing 8-8: Using a for loop to perform a hash operation
Here, we use a for loop to iterate over every feature in the feature dictionary ➊. To do this, first we hash the keys of the dictionary (in the case of string features, these would correspond to the software binaries’ individual strings) modulo vector_size such that the hash values are bounded between zero and vector_size – 1 ➋. We store the result of this operation in the array_index variable.
Still within the for loop, we increment the value of the new_feature array entry at index array_index by the value in our original feature array ➌. In the case of string features, where our feature values are set to 1 to indicate that the software binary has that particular string, we would increment this entry by one. In the case of PE header features, where features have a range of values (for example, corresponding to the amount of memory a PE section will take up), we would add the value of the feature to the entry.
Finally, outside of the for loop, we simply return the new_features dictionary, like this:
return new_features
At this point, sklearn can operate on new_features using just thousands instead of millions of unique features.
Listing 8-9 shows the complete code for the hashing trick, which should now be familiar to you.
def apply_hashing_trick(feature_dict,vector_size=2000):
# create an array of zeros of length 'vector_size'
new_features = [0 for x in range(vector_size)]
# iterate over every feature in the feature dictionary
for key in feature_dict:
# get the index into the new feature array
array_index = hash(key) % vector_size
# add the value of the feature to the new feature array
# at the index we got using the hashing trick
new_features[array_index] += feature_dict[key]
return new_features
Listing 8-9: Complete code for implementing the hashing trick
As you have seen, the feature hashing trick is easy to implement on your own, and doing so ensures that you understand how it works. However, you can also just use sklearn’s implementation, which is easy to use and more optimized.
To use sklearn’s built-in implementation instead of implementing your own hashing solution, you need to first import sklearn’s FeatureHasher class, like this:
from sklearn.feature_extraction import FeatureHasher
Next, instantiate the FeatureHasher class:
hasher = FeatureHasher(n_features=2000)
To do this, you declare n_features to be the size of the new array that results from applying the hashing trick.
Then, to apply the hashing trick to some feature vectors, you simply run them through the FeatureHasher class’s transform method:
features = [{'how': 1, 'now': 2, 'brown': 4},{'cow': 2, '.': 5}]
hashed_features = hasher.transform(features)
The result is effectively the same as our custom implementation of the feature hashing trick shown in Listing 8-9. The difference is that here we’re simply using sklearn’s implementation, since it’s easier to use a well-maintained machine learning library than our own code. The complete sample code is shown in Listing 8-10.
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=10)
features = [{'how': 1, 'now': 2, 'brown': 4},{'cow': 2, '.': 5}]
hashed_features = hasher.transform(features)
Listing 8-10: Implementing FeatureHasher
There are a few things to note about feature hashing before we move on. First, as you may have guessed, feature hashing obfuscates the feature information you pass into a machine learning algorithm because it adds feature values together simply based on the fact that they hash to the same bin. This means that, in general, the fewer bins you use (or the more features you hash into some fixed numbers of bins), the worse your algorithm will perform. Surprisingly, machine learning algorithms can still work well even when using the hashing trick, and because we simply can’t deal with millions or billions of features on modern hardware, we usually have to use the feature hashing trick in security data science.
Another limitation of the feature hashing trick is that it makes it difficult or impossible to recover the original features you hashed when analyzing the internals of your model. Take the example of decision trees: because we’re hashing arbitrary features into each entry of our feature vectors, we don’t know which of the features we added to a given entry is causing a decision tree algorithm to split on this entry, since any number of features could have caused the decision tree to think splitting on this entry was a good idea. Although this is a significant limitation, security data scientists live with it because of the huge benefits of the feature hashing trick in compressing millions of features down to a manageable number.
Now that we’ve gone over the building blocks required for building a real-world malware detector, let’s explore how to build an end-to-end machine learning malware detector.
From a software requirements perspective, our real-world detector will need to do three things: extract features from software binaries for use in training and detection, train itself to detect malware using training data, and actually perform detection on new software binaries. Let’s walk through the code that does each of these things, which will show you how it all fits together.
You can access the code I use in this section at malware_data_science/ch8/code/complete_detector.py in the code accompanying this book, or at the same location in the virtual machine provided with this book. A one-line shell script, malware_data_science/ch8/code/run_complete_detector.sh, shows how to run the detector from the shell.
To create our detector, the first thing we implement is code to extract features from training binaries (I skip over boilerplate code here and focus on the core functions of the program). Extracting features involves extracting the relevant data from training binaries, storing these features within a Python dictionary, and then, if we think our number of unique features will become prohibitively large, transforming them using sklearn’s implementation of the hashing trick.
For simplicity’s sake, we use only string features and choose to use the hashing trick. Listing 8-11 shows how to do both.
def get_string_features(➊path,➋hasher):
# extract strings from binary file using regular expressions
chars = r" -~"
min_length = 5
string_regexp = '[%s]{%d,}' % (chars, min_length)
file_object = open(path)
data = file_object.read()
pattern = re.compile(string_regexp)
strings = pattern.findall(data)
# store string features in dictionary form
➌ string_features = {}
for string in strings:
string_features[string] = 1
# hash the features using the hashing trick
➍ hashed_features = hasher.transform([string_features])
# do some data munging to get the feature array
hashed_features = hashed_features.todense()
hashed_features = numpy.asarray(hashed_features)
hashed_features = hashed_features[0]
# return hashed string features
➎ print "Extracted {0} strings from {1}".format(len(string_features),path)
return hashed_features
Listing 8-11: Defining the get_string_features function
Here, we declare a single function called get_string_features that takes the path to the target binary ➊ and an instance of sklearn’s feature hashing class ➋ as its arguments. Then we extract the target file’s strings using a regular expression, which parses out all printable strings of minimum length 5. Then, we store the features in a Python dictionary ➌ for further processing by setting each string’s value to 1 in the dictionary, simply indicating that that feature is present in the binary.
Next, we hash the features using sklearn’s hashing trick implementation by calling hasher. Notice that we wrap the string_features dictionary in a Python list as we pass it into the hasher instance ➍ because sklearn requires that we pass in a list of dictionaries to be transformed, rather than a single dictionary.
Because we passed in our feature dictionary as a list of dictionaries, features are returned as a list of arrays. Additionally, they are returned in sparse format, a compressed representation that can be useful for processing large matrices, which we won’t discuss in this book. We need to get our data back into a normal numpy vector.
To get the data back into normal format, we call .todense() and .asarray(), and then select the first array in the list of hasher results to recover our final feature vector. The last step in the function is simply to return the feature vector hashed_features ➎ to the caller.
Because sklearn does most of the hard work of training machine learning systems, training a detector requires only a small amount of code once we’ve extracted machine learning features from our target binaries.
To train a detector, we first need to extract features from our training examples, and then instantiate the feature hasher and the sklearn machine learning detector we wish to use (in this case, we use a random forest classifier). Then we need to call sklearn’s fit method on the detector to train it on the examples’ binaries. Finally, we save the detector and feature hasher to disk so we can use them when we want to scan files in the future.
Listing 8-12 shows the code for training the detector.
def ➊get_training_data(benign_path,malicious_path,hasher):
def ➋get_training_paths(directory):
targets = []
for path in os.listdir(directory):
targets.append(os.path.join(directory,path))
return targets
➌ malicious_paths = get_training_paths(malicious_path)
➍ benign_paths = get_training_paths(benign_path)
➎ X = [get_string_features(path,hasher)
for path in malicious_paths + benign_paths]
y = [1 for i in range(len(malicious_paths))]
+ [0 for i in range(len(benign_paths))]
return X, y
def ➏train_detector(X,y,hasher):
classifier = tree.RandomForestClassifier()
➐ classifier.fit(X,y)
➑ pickle.dump((classifier,hasher),open("saved_detector.pkl","w+"))
Listing 8-12: Programming sklearn to train the detector
Let’s start by declaring the get_training_data() function ➊, which extracts features from training examples we provide. The function has three arguments: a path to a directory containing examples of benign binary programs (benign_path), a path to a directory containing examples of malicious binary programs (malicious_path), and an instance of the sklearn FeatureHasher class used to do feature hashing (hasher).
Next, we declare get_training_paths() ➋, a local helper function that gets us the list of absolute file paths for files occurring in a given directory. In the next two lines, we use get_training_paths to get us the lists of paths that occur in the malicious ➌ and benign ➍ training example directories.
Finally, we extract our features and create our label vector. We do this by calling the get_string_features function described in Listing 8-11 on every training example file path ➎. Notice that the label vector has a 1 for every malicious path and a 0 for every benign path, such that the numbers at the indices in the label vector correspond to the label of the feature vectors at those same indices in the X array. This is the form in which sklearn expects feature and label data, and it allows us to tell the library the label for each feature vector.
Now that we’ve finished extracting features and created our feature vector X and our label vector y, we’re ready to tell sklearn to train our detector using the feature vectors and the label vector.
We do this using the train_detector() function ➏, which takes three arguments: the training example feature vectors (X), the label vector (y), and the instance of sklearn’s feature hasher (hasher). In the function body we instantiate tree.RandomForestClassifier, the sklearn detector. Then we pass X and y into the detector’s fit method to train it ➐, and then use the Python pickle module ➑ to save the detector and hasher for future use.
Now let’s go over how to use the saved detector we just trained to detect malware in new program binaries. Listing 8-13 shows how to write the scan_file() function to do this.
def scan_file(path):
if not os.path.exists("saved_detector.pkl"):
print "Train a detector before scanning files."
sys.exit(1)
➊ with open("saved_detector.pkl") as saved_detector:
classifier, hasher = pickle.load(saved_detector)
features = ➋get_string_features(path,hasher)
result_proba = ➌classifier.predict_proba(features)[1]
# if the user specifies malware_paths and
# benignware_paths, train a detector
➍ if result_proba > 0.5:
print "It appears this file is malicious!",`result_proba`
else:
print "It appears this file is benign.",`result_proba`
Listing 8-13: Running the detector on new binaries
Here, we declare the scan_file() function to scan a file to determine whether it’s malicious or benign. Its only argument is the path to the binary that we are going to scan. The function’s first job is to load the saved detector and hasher from the pickle file to which they were saved ➊.
Next, we extract features from the target file using the function get_string_features ➋ we defined in Listing 8-11.
Finally, we call the detector’s predict method to decide whether or not the file in question is malicious, given the features extracted. We do this using the predict_proba method ➌ of the classifier instance and selecting the second element of the array that it returns, which corresponds to the probability that the file is malicious. If this probability is above 0.5, or 50 percent ➍, we say the file is malicious; otherwise, we tell the user that it’s benign. We can change this decision threshold to something much higher to minimize false positives.
Listing 8-14 shows the code for this small-scale but realistic malware detector in its entirety. I hope that the code reads fluidly to you now that you’ve seen how each individual piece works.
#!/usr/bin/python
import os
import sys
import pickle
import argparse
import re
import numpy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import FeatureHasher
def get_string_features(path,hasher):
# extract strings from binary file using regular expressions
chars = r" -~"
min_length = 5
string_regexp = '[%s]{%d,}' % (chars, min_length)
file_object = open(path)
data = file_object.read()
pattern = re.compile(string_regexp)
strings = pattern.findall(data)
# store string features in dictionary form
string_features = {}
for string in strings:
string_features[string] = 1
# hash the features using the hashing trick
hashed_features = hasher.transform([string_features])
# do some data munging to get the feature array
hashed_features = hashed_features.todense()
hashed_features = numpy.asarray(hashed_features)
hashed_features = hashed_features[0]
# return hashed string features
print "Extracted {0} strings from {1}".format(len(string_features),path)
return hashed_features
def scan_file(path):
# scan a file to determine if it is malicious or benign
if not os.path.exists("saved_detector.pkl"):
print "Train a detector before scanning files."
sys.exit(1)
with open("saved_detector.pkl") as saved_detector:
classifier, hasher = pickle.load(saved_detector)
features = get_string_features(path,hasher)
result_proba = classifier.predict_proba([features])[:,1]
# if the user specifies malware_paths and
# benignware_paths, train a detector
if result_proba > 0.5:
print "It appears this file is malicious!",`result_proba`
else:
print "It appears this file is benign.",`result_proba`
def train_detector(benign_path,malicious_path,hasher):
# train the detector on the specified training data
def get_training_paths(directory):
targets = []
for path in os.listdir(directory):
targets.append(os.path.join(directory,path))
return targets
malicious_paths = get_training_paths(malicious_path)
benign_paths = get_training_paths(benign_path)
X = [get_string_features(path,hasher) for path in malicious_paths + benign_paths]
y = [1 for i in range(len(malicious_paths))] + [0 for i in range(len(benign_paths))]
classifier = tree.RandomForestClassifier(64)
classifier.fit(X,y)
pickle.dump((classifier,hasher),open("saved_detector.pkl","w+"))
def get_training_data(benign_path,malicious_path,hasher):
def get_training_paths(directory):
targets = []
for path in os.listdir(directory):
targets.append(os.path.join(directory,path))
return targets
malicious_paths = get_training_paths(malicious_path)
benign_paths = get_training_paths(benign_path)
X = [get_string_features(path,hasher) for path in malicious_paths + benign_paths]
y = [1 for i in range(len(malicious_paths))] + [0 for i in range(len(benign_paths))]
return X, y
parser = argparse.ArgumentParser("get windows object vectors for files")
parser.add_argument("--malware_paths",default=None,help="Path to malware training files")
parser.add_argument("--benignware_paths",default=None,help="Path to benignware training files")
parser.add_argument("--scan_file_path",default=None,help="File to scan")
args = parser.parse_args()
hasher = FeatureHasher(20000)
if args.malware_paths and args.benignware_paths:
train_detector(args.benignware_paths,args.malware_paths,hasher)
elif args.scan_file_path:
scan_file(args.scan_file_path)
else:
print "[*] You did not specify a path to scan," \
" nor did you specify paths to malicious and benign training files" \
" please specify one of these to use the detector.\n"
parser.print_help()
Listing 8-14: Basic machine learning malware detector code
Writing a machine learning–based malware detector is great, but evaluating and improving its performance is necessary if you’re going to deploy the detector with any confidence in its efficacy. Next, you learn different ways to evaluate the performance of your detector.
Conveniently, sklearn contains code that makes it easy to evaluate detection systems using measurements like ROC curves, which you learned about in Chapter 7. The sklearn library also provides additional evaluation functionality specific to evaluating machine learning systems. For example, you can use sklearn’s functions for performing cross-validation, which is a powerful method for predicting how well your detector will work when you deploy it.
In this section, you learn how to use sklearn to plot ROC curves that show your detector’s accuracy. You also learn about cross-validation and how to implement it with sklearn.
Recall that Receiver Operating Characteristic (ROC) curves measure the changes in a detector’s true positive rate (the percentage of malware that it successfully detects) and false positive rate (the percentage of benignware that it falsely flags as malware) as you adjust its sensitivity.
The higher the sensitivity, the more false positives you will get but the greater your detection rate. The lower the sensitivity, the fewer false positives but also the fewer detections you’ll get. To compute a ROC curve you need a detector that can output a threat score such that the higher its value the more likely it is that a binary is malicious. Conveniently, sklearn’s implementations of decision trees, logistic regression, k-nearest neighbors, random forests, and other machine learning approaches covered in this book all provide the option of outputting a threat score that reflects whether a file is malware or benignware. Let’s explore how we can use ROC curves to determine a detector’s accuracy.
To compute a ROC curve for the machine learning detector we built in Listing 8-14, we need to do two things: first, define an experimental setup, and second, implement the experiment using sklearn’s metrics module. For our basic experimental setup, we’ll split our training examples in half such that we use the first half for training and the second half for computing the ROC curve. This split simulates the problem of detecting zero-day malware. Basically, by splitting the data, we’re telling the program, “show me one set of benignware and malware that I’ll use to learn how to identify malware and benignware, and then show me the other set to test me on how well I learned the concept of malware and benignware.” Because the detector has never seen the malware (or benignware) in the test set, this evaluation setup is a simple way to predict how well the detector will do against truly new malware.
Implementing this split with sklearn is straightforward. First, we add an option to the argument parser class of our detector program to signal that we want to evaluate the detector’s accuracy, like this:
parser.add_argument("--evaluate",default=False,
action="store_true",help="Perform cross-validation")
Then, in the part of the program where we process command line arguments, shown in Listing 8-15, we add another elif clause that handles the case that the user has added -evaluate to the command line arguments.
elif args.malware_paths and args.benignware_paths and args.evaluate:
➊ hasher = FeatureHasher()
X, y = ➋get_training_data(
args.benignware_paths,args.malware_paths,hasher)
evaluate(X,y,hasher)
def ➌evaluate(X,y,hasher):
import random
from sklearn import metrics
from matplotlib import pyplot
Listing 8-15: Running the detector on new binaries
Let’s walk through this code in detail. First, we instantiate an sklearn feature hasher ➊, get the training data we require for our evaluation experiment ➋, and then call a function named evaluate ➌, which takes the training data (X, y) and the feature hasher instance (hasher) as its parameters and then imports three modules we need to perform the evaluation. We use the random module to randomly select which training examples to use for training the detector and which to use for testing it. We use the metrics module from sklearn to compute the ROC curve, and we use the pyplot module from matplotlib (the de facto standard Python library for data visualization) to visualize the ROC curve.
Now that we’ve randomly sorted the X and y arrays corresponding to our training data, we can split these arrays into equally sized training and test sets, as shown in Listing 8-16, which continues defining the evaluate() function begun in Listing 8-15.
➊ X, y = numpy.array(X), numpy.array(y)
➋ indices = range(len(y))
➌ random.shuffle(indices)
➍ X, y = X[indices], y[indices]
splitpoint = len(X) * 0.5
➎ splitpoint = int(splitpoint)
➏ training_X, test_X = X[:splitpoint], X[splitpoint:]
training_y, test_y = y[:splitpoint], y[splitpoint:]
Listing 8-16: Splitting the data into training and test sets
First, we convert X and y into numpy arrays ➊, and then we create a list of indices corresponding to the number of elements in X and y ➋. Next, we randomly shuffle these indices ➌ and reorder X and y based on this new order ➍. This sets us up to randomly assign samples to either our training set or our test set, ensuring that we don’t split the samples simply by the order in which they occur in our experimental data directory. To complete the random split, we divide the arrays in half by finding the array index that evenly splits the dataset in half, rounding this point to the nearest integer using the int() function ➎, and then actually splitting the X and y arrays into training and test sets ➏.
Now that we have our training and test sets, we can instantiate and train our decision tree detector using the training data using the following:
classifier = RandomForestClassifier()
classifier.fit(training_X,training_y)
Then we use the trained classifier to get scores for our test examples corresponding to the likelihood that these test examples are malicious:
scores = classifier.predict_proba(test_X)[:,-1]
Here, we call the predict_proba() method on our classifier, which predicts the probability that our test examples are benignware or malware. Then, using numpy indexing magic, we pull out only the probabilities that the samples are malicious, as opposed to benign. Keep in mind that these probabilities are redundant (for example, if the probability an example is malicious is 0.99, then the probability it’s benign is 0.01, since probabilities add up to 1.00), so all we need is the malware probability here.
Now that we’ve computed malware probabilities (which we can also refer to as “scores”) using our detector, it’s time to compute our ROC curve. We do this by first calling the roc_curve function within sklearn’s metrics module, like this:
fpr, tpr, thresholds = metrics.roc_curve(test_y, scores)
The roc_curve function tests a variety of decision thresholds, or score thresholds above which we would deem a software binary to be malicious, and measures what the false positive rate and true positive rate of the detector would be if we were to use that detector.
You can see that the roc_curve function takes two arguments: the label vector for our test examples (test_y) and the scores array, which contains our detector’s judgment about how malicious it thinks each training example is. The function returns three related arrays: fpr, tpr, and thresholds. These arrays are all of equal length, such that the false positive rate, true positive rate, and decision threshold at each index correspond to one another.
Now we can use matplotlib to visualize the ROC curve we just calculated. We do this by calling the plot method on matplotlib’s pyplot module, as shown here:
pyplot.plot(fpr,tpr,'r-')
pyplot.xlabel("Detector false positive rate")
pyplot.ylabel("Detector true positive rate")
pyplot.title("Detector ROC Curve")
pyplot.show()
We call the xlabel, ylabel, and title methods to label the chart’s axes and title, and then the show method to make the chart window pop up.
The resulting ROC curve is shown in Figure 8-2.
Figure 8-2: Visualizing the detector’s ROC curve
You can see from the plot in Figure 8-2 that our detector performs well for such a basic example. At around a 1 percent false positive rate (10–2), it can detect about 94 percent of the malware samples in the test set. We’re only training it on a few hundred training examples here; to get better accuracy we’d need to train it on tens of thousands, hundreds of thousands, or even millions of examples (alas, scaling machine learning to this degree is beyond the scope of this book).
Although visualizing the ROC curve is useful, we can actually do better at predicting our detector’s real-world accuracy by performing many experiments on our training data, not just one. Recall that to perform our test, we split our training examples in half, training the detector on the first half and testing it on the second half. This is really an insufficient test of our detector. In the real world, we won’t be measured on our accuracy on this particular set of test examples but rather on our accuracy on new, previously unseen malware. To get a better sense of how we’ll perform once we deploy, we need to run more than just one experiment on one set of test data; we need to perform many experiments on many test sets and get a sense of the overall trend in accuracy.
We can use cross-validation to do this. The basic idea behind cross-validation is to split our training examples into a number of folds (here I use three folds, but you can use more). For example, if you had 300 examples and decided to split them into three folds, the first 100 samples would go in the first fold, the second 100 would go in the second fold, and the third 100 would go in the third fold.
Then we run three tests. In the first test, we train the system on folds 2 and 3 and test the system on fold 1. On the second test, we repeat this process but train the system on folds 1 and 3 and test the system on fold 2. On the third test, as you can probably predict by now, we train the system on folds 1 and 2 and test the system on fold 3. Figure 8-3 illustrates this cross-validation process.
Figure 8-3: A visualization of a sample cross-validation process
The sklearn library makes implementing cross-validation easy. To do this, let’s rewrite our evaluate function from Listing 8-15 as cv_evaluate.
def cv_evaluate(X,y,hasher):
import random
from sklearn import metrics
from matplotlib import pyplot
from sklearn.cross_validation import KFold
We start the cv_evaluate() function the same way we started our initial evaluation function, except that here we also import the KFold class from sklearn’s cross_validation module. K-fold cross-validation, or KFold for short, is synonymous with the type of cross-validation I just discussed and is the most common way to do cross-validation.
Next, we convert our training data to numpy arrays so that we can use numpy’s enhanced array indexing on it:
X, y = numpy.array(X), numpy.array(y)
The following code actually starts the cross-validation process:
fold_counter = 0
for train, test in KFold(len(X),3,➊shuffle=True):
➋ training_X, training_y = X[train], y[train]
test_X, test_y = X[test], y[test]
We first instantiate the KFold class, passing in the number of training examples we have as the first parameter and the number of folds we’d like to use as the second argument. The third argument, shuffle=True ➊, tells sklearn to randomly sort our training data before dividing it into three folds. The KFold instance is actually an iterator that gives a different training or test example split on each iteration. Within the for loop, we assign the training instances and test instances to the training_X and training_y arrays ➋ that contain the corresponding elements.
After preparing the training and test data, we’re ready to instantiate and train the RandomForestClassifier, as you’ve learned to do previously in this chapter:
classifier = RandomForestClassifier()
classifier.fit(training_X,training_y)
Finally, we compute a ROC curve for this particular fold and then plot a line that represents this ROC curve:
scores = classifier.predict_proba(test_X)[:,-1]
fpr, tpr, thresholds = metrics.roc_curve(test_y, scores)
pyplot.semilogx(fpr,tpr,label="Fold number {0}".format(fold_counter))
fold_counter += 1
Note that we don’t call the matplotlib show method to display the chart just yet. We do this after all the folds have been evaluated and we’re ready to show all three lines at once. As we did in the previous section, we label our axes and give the plot a title, like this:
pyplot.xlabel("Detector false positive rate")
pyplot.ylabel("Detector true positive rate")
pyplot.title("Detector Cross-Validation ROC Curves")
pyplot.legend()
pyplot.grid()
pyplot.show()
The resulting ROC curve is shown in Figure 8-4.
As you can see, our results were similar on every fold, but there is definitely some variation. Our detection rate (true positive rate) over the three runs averages about 90 percent at a 1 percent false positive rate. This estimate, which takes into account all three cross-validation experiments, is a more accurate estimate of our detector’s performance than we’d get if we just ran one experiment on our data; in that case, which samples we happened to use for training and testing would lead to a somewhat random outcome. By running more experiments, we can get a more robust sense of our solution’s efficacy.
Figure 8-4: Plotting the detector’s ROC curve using cross-validation
Note that these results are not great because we’re training on a very small amount of data: a few hundred malware and benignware samples. At my day job, where we train large-scale machine learning malware detection systems, we usually train on hundreds of millions of samples. You don’t need hundreds of millions of samples to train your own malware detector, but you’ll want to assemble datasets of at least tens of thousands of samples to start getting really good performance (for example, a 90 percent detection rate at a 0.1 percent false positive rate).
So far, I covered how to use Python and sklearn to extract features from a training dataset of software binaries, and then train and evaluate a decision tree–based machine learning approach. To improve the system, you can use features other than or in addition to printable string features (for example, the PE header, instruction N-gram, or Import Address Table features discussed previously), or you could use a different machine learning algorithm.
To make the detector more accurate, I recommend going beyond sklearn’s RandomForestClassifier (sklearn.ensemble.RandomForestClassifier) to try other classifiers. Recall from the previous chapter that random forest detectors are also based on decision trees, but instead of just one decision tree, they build many decision trees, randomizing the way they are built. To determine whether a new file is malware or benignware, each of these decision trees makes individual decisions, which we combine by summing them up and dividing them by the total number of trees to get the average result.
You can also use other algorithms that sklearn provides, such as logistic regression. Using any of these algorithms can be as simple as doing a search and replace in the sample code discussed in this chapter. For example, in this chapter we instantiate and train our decision tree as follows:
classifier = RandomForestClassifier()
classifier.fit(training_X,training_y)
But you can simply replace that code with this:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(training_X,training_y)
This replacement yields a logistic regression detector instead of a decision tree–based detector. By computing a new cross validation–based evaluation of this Logistic Regression detector and comparing it to the results from Figure 8-4, you could determine which works better.
In this chapter, you learned the ins and outs of building machine learning–based malware detectors. Specifically, you learned how to extract features from software binaries for machine learning, how to compress these features using the hashing trick, and how to train machine learning–based malware detectors using these extracted features. You also learned how to plot ROC curves to examine the relationship between a detector’s detection threshold and its true and false positive rates. Finally, you learned about cross-validation, a more advanced evaluation concept, and other possible extensions to enhance the detector used in this chapter.
This concludes this book’s discussion of machine learning–based malware detection using sklearn. We’ll cover another set of machine learning methods, known as deep learning methods or artificial neural networks in Chapters 10 and 11. You now have the basic knowledge necessary to effectively use machine learning in the context of malware identification. I encourage you to read more about machine learning. Because computer security is in many ways a data analysis problem, machine learning is here to stay in the security industry and will continue to be useful not only in detecting malicious binaries but also in detecting malicious behavior in network traffic, system logs, and other contexts.
In the next chapter, we’ll take a deep dive into visualizing malware relationships, which can help us quickly understand the similarities and differences between large numbers of malware samples.