Malware Data Science: Attack Detection and Attribution

5
SHARED CODE ANALYSIS

Suppose you discovered a new malware sample on your network. How would you begin to analyze it? You could submit it to a multi-engine antivirus scanner such as VirusTotal to learn what malware family it belongs to. However, such results are often unclear and ambiguous, because engines often label the malware in generic terms like “agent” that mean nothing. You could also run the sample through CuckooBox or some other malware sandbox to get a limited report on the malware sample’s callback servers and behaviors.

When these approaches don’t provide enough information, you may need to reverse-engineer the sample. At this stage, shared code analysis can dramatically improve your workflow. By revealing which previously analyzed samples the new malware sample is similar to, and thus revealing the code they share, shared code analysis allows you to reuse your previous analyses on new malware so that you’re not starting from scratch. Understanding where this previously seen malware came from can also help you figure out who may have deployed the malware.

Shared code analysis, also called similarity analysis, is the process by which we compare two malware samples by estimating the percentage of precompilation source code they share. It differs from shared attribute analysis, which compares malware samples based on their external attributes (the desktop icons they use, for example, or the servers they call out to).

In reverse engineering, shared code analysis helps identify samples that can be analyzed together (because they were generated from the same malware toolkit or are different versions of the same malware family), which can determine whether the same developers could have been responsible for a group of malware samples.

Consider the output shown in Listing 5-1, which comes from a program you’ll build later in this chapter to illustrate the value of malware shared code analysis. It shows previously seen samples that may share code with the new sample as well as comments made on those older samples.

Listing 5-1: The results of basic shared code analysis

Given a new sample, shared code estimation allows us to see, within seconds, which samples it likely shares code with and what we know about those samples. In this example, it reveals that a very similar sample is from a known APT, or advanced persistent threat, thus providing immediate context for this new malware.

We can also visualize sample shared code relationships using network visualization, which you learned about in Chapter 4. For example, Figure 5-1 shows a network of shared code relationships between samples in an advanced persistent threat dataset.

As you can see from the visualization, automated shared code analysis techniques can quickly uncover the existence of malware families that would have taken days or weeks to discover through manual analysis. In this chapter, you’ll learn to use these techniques to do the following:

Identify new malware families that come from the same malware toolkits or were written by the same attackers.
Determine code similarity between a new sample and previously seen samples.
Visualize malware relationships to better understand code-sharing patterns between malware samples and to communicate your results to others.
Use two proof-of-concept tools I built for this book that implement these ideas and allow you to see malware shared code relationships.

Figure 5-1: An example of the kind of visualization you will learn to create in this chapter, showing shared code relationships between some of the APT1 samples

First, I introduce the test malware samples you’ll be using in this chapter, which are the PLA APT1 samples from Chapter 4 and an assortment of crimeware samples. Then, you learn about mathematical similarity comparison and the concept of the Jaccard index, a set-theoretic method for comparing malware samples in terms of their shared features. Next, I introduce the concept of features to show how you can use them in conjunction with the Jaccard index to approximate the amount of code two malware samples share. You also learn how to evaluate malware features in terms of their usefulness. Finally, we create visualizations of malware code sharing at multiple scales, as shown in Figure 5-1, by leveraging your knowledge of network visualization from Chapter 4.

MALWARE SAMPLES USED IN THIS CHAPTER

In this chapter, we use real-world malware families that share significant amounts of code with one another to do our experiments. These datasets are available thanks to Mandiant and Mila Parkour, who curated these samples and made them available to the research community. In reality, however, you might not know what family a malware sample belongs to, or to what degree your new malware samples are similar to previously seen samples. But going through examples where we d know will be good practice, because it allows us to verify that our automated inferences of sample similarity line up with our knowledge of which samples actually belong in the same group.

The first samples come from the APT1 dataset we used in Chapter 4 to demonstrate shared resource analysis. The other samples consist of thousands of crimeware malware samples developed by criminals to steal people’s credit cards, turn their computers into zombie hosts hooked into botnets, and so on. These are real-world samples sourced from a commercial malware feed provided as a paid service for threat intelligence researchers.

To identify their family names, I have input each sample into the Kaspersky antivirus engine. Kaspersky was able to classify 30,104 of these samples with robust hierarchical classifications (such as trojan.win32.jorik.skor.akr, indicating the jorik.skor family), assigned a class of “unknown” to 41,830 samples, and assigned generic labels (such as, generically, “win32 Trojan”) to the remaining 28,481 samples.

Because of the inconsistency of the Kaspersky labels (some Kaspersky label groupings, such as the jorik family, represent a very diffuse range of malware, whereas others, such as webprefix, represent a very specific set of variants) and the fact that Kaspersky often misses or mislabels malware, I selected seven malware classes that Kaspersky detects with high confidence. Specifically, these include the dapato, pasta, skor, vbna, webprefix, xtoober, and zango families.

Preparing Samples for Comparison by Extracting Features

How do we even begin to think about estimating the amount of code two malicious binaries may have shared before they were compiled by attackers? There are many ways one might consider approaching this problem, but in the hundreds of computer science research papers that have been published on the topic, a common theme has emerged: to estimate the amount of shared code between binaries, we group malware samples into “bags of features” before comparing.

By features I mean any malware attribute we might possibly want to consider in estimating the code similarity between samples. For example, the features we use could be the printable strings we can extract from the binaries. Instead of thinking of the samples as an interconnected system of functions, dynamic library imports, and so on, we think of malware as a bag of independent features for mathematical convenience (for example, a set of strings that have been extracted from the malware).

How Bag of Features Models Work

To understand how a bag of features works, consider a Venn diagram between two malware samples, as shown in Figure 5-2.

Here, sample A and sample B are shown as bags of features (features are represented as ellipses inside the Venn diagram). We can compare them by examining which features are shared between the two samples. Computing the overlap between two sets of features is fast, and can be used to compare malware samples’ similarity based on arbitrary features that we come up with.

For example, when dealing with packed malware, we may want to use features based on malware dynamic run logs since running malware in a sandbox is a way to get malware to unpack itself. In other cases, we may use strings extracted from the static malware binary to perform the comparison.

Figure 5-2: An illustration of the “bag of features” model for malware code sharing analysis

In the case of dynamic malware analysis, we may want to compare samples based not just on what behaviors they share but also on the order in which they express behaviors, or what we call their sequences of behaviors. A common way to incorporate sequence information into malware sample comparisons is to extend the bag of features model to accommodate sequential data using N-grams.

What are N-Grams?

An N-gram is a subsequence of events that has a certain length, N, of some larger sequence of events. We extract this subsequence from a larger sequence by sliding a window over the sequential data. In other words, we get N-grams by iterating over a sequence and, at each step, recording the subsequence from the event at index i to the event at index i + N – 1, as shown in Figure 5-3.

In Figure 5-3, the sequence of integers (1,2,3,4,5,6,7) is translated into five different subsequences of length 3: (1,2,3), (2,3,4), (3,4,5), (4,5,6), (5,6,7).

Of course, we can do this with any sequential data. For example, using an N-gram word length of 2, the sentence “how now brown cow” yields the following subsequences: “how now”, “now brown”, and “brown cow.” In malware analysis, we would extract N-grams of sequential API calls that a malware sample made. Then we would represent the malware as a bag of features and use N-gram features to compare the malware sample to some other malware sample’s N-grams, thereby incorporating sequence information into the bag of features comparison model.

Figure 5-3: A visual explanation of how we can extract N-grams from malware’s assembly instructions and dynamic API call sequences, where N = 3

Including sequence information in our comparison of malware samples has advantages and disadvantages. The advantage is that when order matters in the comparison (for example, when we care that API call A was observed before API call B, which was observed before API call C), it allows us to capture order, but when order is superfluous (for example, malware randomizing the order of API calls A, B, and C on every run), it can actually make our shared code estimation much worse. Deciding whether to include order information in our malware shared code estimation work depends on what kind of malware we’re working with, and requires that we experiment.

Using the Jaccard Index to Quantify Similarity

Once you’ve represented a malware sample as a bag of features, you’ll need to measure the degree of similarity between that sample’s bag of features and some other sample’s bag of features. To estimate the extent of code sharing between two malware samples, we use a similarity function, which should have the following properties:

It yields a normalized value such that all similarity comparisons between pairs of malware samples can be placed on a common scale. By convention, the function should yield a value ranging from 0 (no code sharing) to 1 (samples share 100 percent of their code).
The function should help us make accurate estimates of code sharing between two samples (we can determine this empirically through experimentation).
We should be able to easily understand why the function models code similarity well (it should not be a complicated mathematical black box that takes a lot of effort to understand or explain).

The Jaccard index is a simple function that has these properties. In fact, even though other mathematical approaches to code similarity estimation have been tried in the security research community (for example, cosine distance, L1 distance, Euclidean [L2] distance, and so on), the Jaccard index has emerged as the most widely adopted—and for good reason. It simply and intuitively expresses the degree of overlap between two sets of malware features, giving us the percentage of unique features common to both of the two sets normalized by the percentage of unique features that exist in either set.

Figure 5-4 illustrates examples of Jaccard index values.

Figure 5-4: A visual illustration of the idea behind the Jaccard index

This illustrates four pairs of malware features extracted from four pairs of malware samples. Each image shows the features shared between the two sets, the features not shared between the two sets, and the resulting Jaccard index for the given pair of malware samples and associated features. You can see that the Jaccard index between the samples is simply the number of features shared between the samples divided by the total number of features drawn in the Venn diagram.

Using Similarity Matrices to Evaluate Malware Shared Code Estimation Methods

Let’s discuss four methods for determining whether two malware samples come from the same family: instruction sequence-based similarity, strings-based similarity, Import Address Table–based similarity, and dynamic API call–based similarity. To compare these four methods, we’ll use a similarity matrix visualization technique. Our goal here will be to compare the relative strengths and weaknesses of each method in terms of its ability to illuminate shared code relationships between samples.

To get started, let’s go over the concept of a similarity matrix. Figure 5-5 compares an imaginary set of four malware samples using a similarity matrix.

Figure 5-5: An illustration of a notional similarity matrix

This matrix allows you to see the similarity relationship between all samples. You can see that some space is wasted in this matrix. For example, we don’t care about the similarities represented in shaded boxes, as these entries just contain comparisons between a given sample and itself. You can also see that the information on either side of the shaded boxes is repeated, so you only need to look at one or the other.

Figure 5-6 gives a real-world example of a malware similarity matrix. Note that due to the large number of malware samples shown in the figure, each similarity value is represented by a shaded pixel. Instead of rendering the names of each sample, we render the family names for each sample along the horizontal and vertical axes. A perfect similarity matrix would look like a chain of white squares running diagonally from the top left to the bottom right, since the rows and columns representing each family are grouped together, and we expect all members of a given family to be similar to one another, but not samples from other families.

Figure 5-6: A real-world malware similarity matrix computed over the seven malware families

In the results given in Figure 5-6, you can see that some of the family squares are completely white—these are good results, because white pixels within a family square indicate an inferred similarity relationship between samples of the same family. Some are much darker, which means we did not detect strong similarity relationships. Finally, sometimes there are lines of pixels outside the family squares, which are either evidence of related malware families or false positives, meaning that we detected code-sharing between families despite their being inherently different.

Next, we’ll use similarity matrix visualizations like Figure 5-6 to compare the results of four different code-sharing estimation methods, starting with a description of instruction sequence-based similarity analysis.

Instruction Sequence–Based xSimilarity

The most intuitive way to compare two malware binaries in terms of the amount of code they share is by comparing their sequences of x86 assembly instructions, since samples that share sequences of instructions are likely to have shared, before compilation, actual source code. This requires disassembling malware samples using, for example, the linear disassembly technique introduced in Chapter 2. Then we can use the N-gram extraction approach I discussed previously to extract sequences of instructions in the order they appear in the .text section of the malware file. Finally, we can use the instruction N-grams to compute Jaccard indices between samples to estimate how much code we think they share.

The value we use for N during N-gram extraction depends on our analysis goals. The larger N is, the larger our extracted instruction subsequences will be, and thus the harder it will be for malware samples’ sequences to match. Setting N to a large number helps identify only samples that are highly likely to share code with one another. On the other hand, you can make N smaller to look for subtle similarities between samples, or if you suspect that the samples employ instruction reordering to obscure similarity analysis.

In Figure 5-7, N is set to 5, which is an aggressive setting that makes it harder for samples to match.

Figure 5-7: The similarity matrix generated using instruction N-gram features. Using N = 5, we completely miss many families’ similarity relationships but do well on webprefix and pasta.

The results in Figure 5-7 are not very compelling. While the instruction-based similarity analysis correctly identifies similarities between some families, it doesn’t within other families (for example, it detects few similarity relationships in dapato, skor, and vbna). It’s important to note, however, that there are few false positives in this analysis (false inferences of similarity between samples from different families, versus true inferences of similarities within samples of the same family).

As you can see, a limitation of instruction subsequence shared code analysis is that it can miss many code-sharing relationships between samples. This is because malware samples may be packed such that most of their instructions only become visible once we execute the malware samples and let them unpack themselves. Without unpacking our malware samples, the instruction sequence shared code estimation method will likely not work very well.

Even when we unpack our malware samples, the approach can be problematic, because of the noise introduced by the source code compilation process. Indeed, compilers can compile the same source code into radically different sequences of assembly instructions. Take, for example, the following simple function written in C:

int f(void) {
    int a = 1;
    int b = 2;
  ➊ return (a*b)+3;
}

You might think that regardless of compiler, the function would reduce to the same sequence of assembly instructions. But in fact, compilation depends heavily not just on what compiler you use, but also on the compiler settings. For example, compiling this function using the clang compiler under its default settings yields the following instructions corresponding to the line at ➊ in the source code:

movl    $1, -4(%rbp)
movl    $2, -8(%rbp)
movl    -4(%rbp), %eax
imull   -8(%rbp), %eax
addl    $3, %eax

In contrast, compiling the same function with the –O3 flag set, which tells the compiler to optimize the code for speed, yields the following assembly for the same line of the source code:

movl $5, %eax

The difference results from the fact that in the second example, the compiler pre-computed the result of the function instead of explicitly computing it, as in the first compilation example. This means that if we compared these functions based on instruction sequences, they wouldn’t appear at all similar, even though in reality they were compiled from exactly the same source code.

Beyond the problem of identical C and C++ code appearing to be very different when we’re looking at its assembly instructions, there’s an additional problem that arises when we compare binaries based on their assembly code: many malware binaries are now authored in high-level languages like C#. These binaries contain standard boilerplate assembly code that simply interprets these higher-level languages’ bytecode. So, although binaries written in the same high-level language may share very similar x86 instructions, their actual bytecode may reflect the fact that they come from very different source code.

Strings-Based Similarity

We can compute strings-based malware similarity by extracting all contiguous printable sequences of characters in the samples and then computing the Jaccard index between all pairs of malware samples based on their shared string relationships.

This approach gets around the compiler problem because the strings extracted from a binary tend to be format strings defined by the programmer, which compilers as a general rule do not transform, regardless of which compilers the malware authors are using or what parameters they give the compilers. For example, a typical string extracted from a malware binary might read, “Started key logger at %s on %s and time %s.” Regardless of compiler settings, this string will tend to look identical among multiple binaries and is related to whether or not they’re based on the same source code base.

Figure 5-8 shows how well the string-based code-sharing metric identifies the correct code-sharing relationships in the crimeware dataset.

Figure 5-8: The similarity matrix generated using string features

At first glance, this method does far better at identifying the malware families than the instruction-based method, accurately recovering much of the similarity relationships for all seven families. However, unlike the instruction similarity method, there are a few false positives, since it incorrectly predicts that xtoober and dapato share some level of code. It’s also worth noting that this method didn’t detect similarities between samples in some families, performing particularly poorly on the zango, skor, and dapato families.

Import Address Table–Based Similarity

We can compute what I call “Import Address Table–based similarity” by comparing the DLL imports made by malware binaries. The idea behind this approach is that even if the malware author has reordered instructions, obfuscated the initialized data section of the malware binary, and implemented anti-debugger and anti-VM anti-analysis techniques, they may have left the exact same import declarations in place. The results for the Import Address Table method are shown in Figure 5-9.

Figure 5-9: The similarity matrix generated using Import Address Table features

The figure shows that the Import Address Table method does better than any of the preceding methods at estimating the similarity relationships between the webprefix and xtoober samples and does very well overall, even though it misses many of the skor, dapato, and vbna relationships. It’s also notable that this method gives few false positives on our experimental dataset.

Dynamic API Call–Based Similarity

The final comparison method I introduce in this chapter is dynamic malware similarity. The advantage of comparing dynamic sequences is that even if malware samples are extremely obfuscated or packed, they will tend to perform similar sequences of actions within a sandboxed virtual machine as long as they’re derived from the same code or borrow code from one another. To implement this approach, you’ll need to run malware samples in a sandbox and record the API calls they make, extract N-grams of API calls from the dynamic logs, and finally compare the samples by taking the Jaccard index between their bags of N-grams.

Figure 5-10 shows that the dynamic N-gram similarity approach does about as well as the import and string methods in most cases.

Figure 5-10: The similarity matrix generated using dynamic API call N-gram features

The imperfect results here show that this method is no panacea. Simply running malware in a sandbox is not sufficient to trigger many of its behaviors. Variations of a command line malware tool, for example, may or may not enable an important code module, and therefore execute different sequences of behavior, even though they may share most of their code.

Another problem is that some samples detect that they’re running in a sandbox and then promptly exit execution, leaving us with little information to make comparisons. In summary, like the other similarity approaches I’ve outlined, dynamic API call sequence similarity isn’t perfect, but it can provide impressive insight into similarities between samples.

Building a Similarity Graph

Now that you understand the concepts behind methods for identifying malware code sharing, let’s build a simple system that performs this analysis over a malware dataset.

First, we need to estimate the amount of code that samples share by extracting the features we would like to use. These could be any of the features described previously, such as Import Address Table–based functions, strings, N-grams of instructions, or N-grams of dynamic behavior. Here, we’ll use printable string features because they perform well and are simple to extract and understand.

Once we’ve extracted the string features, we need to iterate over every pair of malware samples, comparing their features using the Jaccard index. Then, we need to build a code-sharing graph. To do this, we first need to decide on a threshold that defines how much code the two samples share—a standard value I use in my research is 0.8. If the Jaccard index for a given pair of malware samples is above that value, we create a link between them for visualization. The final step is to study the graph to see which samples are connected by shared code relationships.

Listings 5-2 through 5-6 contain our sample program. Because the listing is long, I break it into pieces and explain each piece as I go. Listing 5-2 imports the libraries we’ll use, and declares the jaccard() function, which computes the Jaccard index between two samples’ sets of features.

#!/usr/bin/python

import argparse
import os
import networkx
from networkx.drawing.nx_pydot import write_dot
import itertools

def jaccard(set1, set2):
    """
    Compute the Jaccard distance between two sets by taking
    their intersection, union and then dividing the number
    of elements in the intersection by the number of elements
    in their union.
    """
    intersection = set1.intersection(set2)
    intersection_length = float(len(intersection))
    union = set1.union(set2)
    union_length = float(len(union))
    return intersection_length / union_length

Listing 5-2: The imports and a helper function to compute the Jaccard index between two samples

Next, in Listing 5-3, we declare two additional utility functions: getstrings(), which finds the set of printable string sequences within the malware files we’ll be analyzing, and pecheck(), which ensures that target files are indeed Windows PE files. We’ll use these functions later when we’re performing feature extraction on the target malware binaries.

def getstrings(fullpath):
    """
    Extract strings from the binary indicated by the 'fullpath'
    parameter, and then return the set of unique strings in
    the binary.
    """
    strings = os.popen("strings '{0}'".format(fullpath)).read()
    strings = set(strings.split("\n"))
    return strings

def pecheck(fullpath):
    """
    Do a cursory sanity check to make sure 'fullpath' is
    a Windows PE executable (PE executables start with the
    two bytes 'MZ')
    """
    return open(fullpath).read(2) == "MZ"

Listing 5-3: Declaring the functions we’ll use in feature extraction

Next, in Listing 5-4, we parse our user’s command line arguments. These arguments include the target directory in which the malware we’ll be analyzing exists, the output .dot file to which we’ll write the shared code network we build, and the Jaccard index threshold, which determines how high the Jaccard index must be between two samples for the program to decide that they share a common code base with one another.

If __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Identify similarities between malware samples and build similarity graph"
    )

    parser.add_argument(
        "target_directory",
        help="Directory containing malware"
    )

    parser.add_argument(
        "output_dot_file",
        help="Where to save the output graph DOT file"
    )

    parser.add_argument(
        "--jaccard_index_threshold", "-j", dest="threshold", type=float,
        default=0.8, help="Threshold above which to create an 'edge' between samples"
    )

    args = parser.parse_args()

Listing 5-4: Parsing the user’s command line arguments

Next, in Listing 5-5, we use the helper functions we declared earlier to do the main work of the program: finding PE binaries in the target directory, extracting features from them, and initializing a network that we’ll use to express similarity relationships between the binaries.

malware_paths = []  # where we'll store the malware file paths
malware_features = dict()  # where we'll store the malware strings
graph = networkx.Graph()  # the similarity graph

for root, dirs, paths in os.walk(args.target_directory):
    # walk the target directory tree and store all of the file paths
    for path in paths:
        full_path = os.path.join(root, path)
        malware_paths.append(full_path)

# filter out any paths that aren't PE files
malware_paths = filter(pecheck, malware_paths)

# get and store the strings for all of the malware PE files
for path in malware_paths:
    features = getstrings(path)
    print "Extracted {0} features from {1} ...".format(len(features), path)
    malware_features[path] = features

    # add each malware file to the graph
    graph.add_node(path, label=os.path.split(path)[-1][:10])

Listing 5-5: Extracting features from PE files in the target directory and initializing the shared code network

After extracting features from our target samples, we need to iterate over every pair of malware samples, comparing their features using the Jaccard index. We do this in Listing 5-6. We also build a code-sharing graph where samples are linked together if their Jaccard index is above some user-defined threshold. The threshold I’ve found works best in my research is 0.8.

# iterate through all pairs of malware
for malware1, malware2 in itertools.combinations(malware_paths, 2):

    # compute the jaccard distance for the current pair
    jaccard_index = jaccard(malware_features[malware1], malware_features[malware2])

    # if the jaccard distance is above the threshold, add an edge
    if jaccard_index > args.threshold:
        print malware1, malware2, jaccard_index
        graph.add_edge(malware1, malware2, penwidth=1+(jaccard_index-args.threshold)*10)

# write the graph to disk so we can visualize it
write_dot(graph, args.output_dot_file)

Listing 5-6: Creating a code-sharing graph in Python

The code in Listings 5-2 through 5-6 produces the chart shown in Figure 5-11 when applied to the APT1 malware samples. To visualize the chart, you need to use the fdp Graphviz tool (discussed in Chapter 4) to enter the command fdp -Tpng network.dot -o network.png.

Figure 5-11: The complete string-based similarity graph for the APT1 samples

The amazing thing about this output is that within a few minutes, we reproduced much of the manual, painstaking work that the original analysts of the APT1 produced in their report, identifying many of the malware families used by these nation state–level attackers.

We know that our method has performed accurately relative to the manual reverse engineering work that these analysts performed, because the names on the nodes are the names given to them by the Mandiant analysts. You can see this in the way samples with similar names group together in the network visualization in Figure 5-11, such as the “STARSYPOUN” samples in the central circle. Because the malware in our network visualization automatically groups together in a way that aligns with these family names, our method seems to “agree” with the Mandiant malware analysts. You can extend the code in Listings 5-2 through 5-6 and apply it to your own malware for similar intelligence.

Scaling Similarity Comparisons

Although the code in Listings 5-2 through 5-6 works well for small malware datasets, it doesn’t work well for a large number of malware samples. This is because comparing all pairs of malware samples in a dataset grows quadratically with the number of samples. Specifically, the following equation gives the number of Jaccard index computations necessary to compute a similarity matrix over a dataset of size n:

For example, let’s return to the similarity matrix in Figure 5-5 to see how many Jaccard indices we would need to compute the four samples. At first glance, you might say 16 (4²), because that’s how many cells are in the matrix. However, because the bottom triangle of the matrix contains duplicates of the top triangle of the matrix, we don’t need to compute these twice. This means that we can subtract 6 from our total number of computations. Furthermore, we don’t need to compare malware samples to themselves, so we can eliminate the diagonal from the matrix, allowing us to subtract four more computations.

The number of computations necessary is as follows:

This seems manageable, until our dataset grows to 10,000 malware samples, for example, which would require 49,995,000 computations. A dataset that has 50,000 samples would require 1,249,975,000 Jaccard index computations!

To scale malware similarity comparisons, we need to use randomized comparison approximation algorithms. The basic idea is to allow for some error in our computation of comparisons in exchange for a reduction in computation time. For our purposes, an approximate comparison approach known as minhash serves this purpose beautifully. The minhash method allows us to compute the Jaccard index using approximation to avoid computing similarities between nonsimilar malware samples below some predefined similarity threshold so that we can analyze shared code relationships between millions of samples.

Before you read about why minhash works, note that this is a tricky algorithm that can take some time to understand. If you decide to skip the “Minhash in Depth” section, just read the “Minhash in a Nutshell” section and use the code provided, and you should have no problems scaling your code sharing analysis.

Minhash in a Nutshell

Minhash takes a malware sample’s features and hashes them with k hash functions. For each hash function, we retain only the minimum value of the hashes computed over all the features, so that the set of malware features is reduced to a fixed size array of k integers, which we call the minhashes. To compute the approximate Jaccard index between two samples based on their minhash arrays, you now just need to check how many of the k minhashes match, and divide that by k.

Magically, the number that falls out of these computations is a close approximation of the true Jaccard index between any two samples. The benefit of using minhash instead of a literal computation of the Jaccard index is that it’s much faster to compute.

In fact, we can even use minhash to cleverly index malware in a database such that we only need to compute comparisons between malware samples that are likely to be similar because at least one of their hashes matched, thereby dramatically speeding up computation of similarities within malware datasets.

Minhash in Depth

Let’s now discuss the math behind minhash in depth. Figure 5-12 shows the sets of features (represented by the shaded circles) for two malware samples, how they are hashed and then sorted based on their hashes, and how they’re finally compared based on the value of the first element of each list.

Figure 5-12: An illustration of the idea behind minhash

The probability that the first elements will match is equal to the Jaccard index between the samples. How this works is beyond the scope of this book, but this serendipitous fact is what lets us approximate the Jaccard index using hashes.

Of course, just performing this hashing, sorting, and first-element-checking operation doesn’t tell us that much if we only do it once—the hashes either match or they don’t, and we can’t guess the underlying Jaccard index very accurately based on that one match. To get a better estimate of this underlying value, we have to use k hash functions and repeat this operation k times, and then estimate the Jaccard index by dividing the number of times these first elements matched by k. Our expected error in estimating the Jaccard index is defined as the following:

So the more times we perform this procedure, the more certain we’ll be (I tend to set k to 256 so that the estimate is off by 6 percent, on average).

Suppose we compute a minhash array for every malware sample in a malware dataset containing one million samples. How do we then use the minhashes to speed up the search for malware families in the dataset? We could iterate over every pair of malware samples in the dataset and compare their minhash arrays, which would lead to 499,999,500,000 comparisons. Even though it’s faster to compare minhash arrays than to compute the Jaccard index, this is still way too many comparisons to make on modern hardware. We need some way of exploiting the minhashes to optimize the comparison process even more.

The standard approach to this problem is to use sketching combined with database indexing, which creates a system in which we compare only samples that we already know are highly likely to be similar. We make a sketch by hashing multiple minhashes together.

When we get a new sample, we check whether the database contains any sketches that match the new sample’s sketches. If so, the new sample is compared with the matching samples using their minhash arrays to approximate the Jaccard index between the new sample and the older, similar samples. This avoids having to compare the new sample to all samples in the database, and instead comparing it to only those samples that are highly likely to have high Jaccard indices with this new sample.

Building a Persistent Malware Similarity Search System

Now that you’ve learned the pros and cons of using a variety of malware feature types to estimate shared code relationships between malware samples. You’ve also learned about the Jaccard index, similarity matrices, and the way in which minhash can make computing similarities between malware samples in even very large datasets tractable. With all this knowledge in hand, you understand all of the fundamental concepts necessary to build a scalable malware shared code search system.

Listings 5-7 through 5-12 show an example of a simple system in which I index malware samples based on their string features. In your own work, you should feel confident in modifying this system to use other malware features, or extending it to support more visualization features. Because the listing is long, I’ve broken it up and we’ll cover each subsection in turn.

To begin, Listing 5-7 imports the Python packages required for our program.

#!/usr/bin/python

import argparse
import os
import murmur
import shelve
import numpy as np
from listings_5_2_to_5_6 import *

NUM_MINHASHES = 256
SKETCH_RATIO = 8

Listing 5-7: Importing Python modules and declaring minhash-related constants

Here, I import packages like murmur, shelve, and sim_graph. For example, murmur is a hashing library that we use to compute the minhash algorithm I just discussed. We use shelve, a simple database module included in the Python standard library, to store information about samples and their minhashes, which we use to compute similarities. We use listings_5_2_to_5_6.py to get functions for computing sample similarity.

We also declare two constants in Listing 5-7: NUM_MINHASHES and SKETCH_RATIO. These correspond to the number of minhashes and the ratio of minhashes to sketches we compute for each sample. Recall that the more minhashes and sketches we use, the more accurate our similarity computations. For example, 256 minhashes and a ratio of 8:1 (32 sketches) is enough to yield acceptable accuracy at a low computational cost.

Listing 5-8 implements database functionality that we use to initialize, access, and delete the shelve database we use to store malware sample information.

➊ def wipe_database():
       """
       This problem uses the python standard library 'shelve' database to persist
       information, storing the database in the file 'samples.db' in the same
       directory as the actual Python script. 'wipe_database' deletes this file
       effectively reseting the system.
       """
       dbpath = "/".join(__file__.split('/')[:-1] + ['samples.db'])
       os.system("rm -f {0}".format(dbpath))

➋ def get_database():
       """
       Helper function to retrieve the 'shelve' database, which is a simple
       key value store.
       """
       dbpath = "/".join(__file__.split('/')[:-1] + ['samples.db'])
       return shelve.open(dbpath,protocol=2,writeback=True)

Listing 5-8: Database helper functions

We define wipe_database() ➊ to delete our program’s database in case we want to wipe out sample information we’ve stored and start over. Then we define get_database() ➋ to open our database, creating it if it doesn’t yet exist, and then return a database object, allowing us to store and retrieve data about our malware samples.

Listing 5-9 implements a core piece of the code for our shared code analysis: minhash.

def minhash(features):
    """
    This is where the minhash magic happens, computing both the minhashes of
    a sample's features and the sketches of those minhashes. The number of
    minhashes and sketches computed is controlled by the NUM_MINHASHES and
    NUM_SKETCHES global variables declared at the top of the script.
    """
    minhashes = []
    sketches = []
  ➊ for i in range(NUM_MINHASHES):
        minhashes.append(
          ➋ min([murmur.string_hash(`feature`,i) for feature in features])
        )
  ➌ for i in xrange(0,NUM_MINHASHES,SKETCH_RATIO):
      ➍ sketch = murmur.string_hash(`minhashes[i:i+SKETCH_RATIO]`)
        sketches.append(sketch)
    return np.array(minhashes),sketches

Listing 5-9: Obtaining minhashes and sketches for a sample

We loop NUM_MINHASHES times ➊ and append one minhash value. Each minhash value is computed by hashing all the features and then taking the minimum hash value. To perform this computation, we use the murmur package’s string_hash() function to hash the features, and then we take the minimum value of the list of hashes by calling Python’s min() function ➋.

The second argument of string_hash is a seed value, which causes the hash function to map to different hashes depending on the seed’s value. Because each minhash value requires a unique hash function such that all of our 256 min hash values aren’t identical, on each iteration we seed the string_hash function with our counter value i, which causes the features to map to different hashes on each iteration.

Then, we loop over the minhashes we’ve computed and use the minhashes to compute sketches ➌. Recall that sketches are hashes of multiple minhashes, which we use for database indexing of our malware samples so that we can quickly retrieve samples that are likely to be similar to one another by querying the database. In the next code listing, we loop over all of our sample’s minhashes with step size SKETCH_RATIO, hashing each chunk of hashes as we go to obtain our sketches. Finally, we use murmur package’s string_hash function to hash the minhashes together ➍.

Listing 5-10 uses get_database() from Listing 5-8, the getstrings() function from the sim_graph module we imported, and the minhash() function from Listing 5-9 to create a function that indexes samples into our system’s database.

def store_sample(path):
    """
    Function that stores a sample and its minhashes and sketches in the
    'shelve' database
    """
  ➊ db = get_database()
  ➋ features = getstrings(path)
  ➌ minhashes,sketches = minhash(features)
  ➍ for sketch in sketches:
        sketch = str(sketch)
      ➎ if not sketch in db:
            db[sketch] = set([path])
        else:
            obj = db[sketch]
          ➏ obj.add(path)
            db[sketch] = obj
        db[path] = {'minhashes':minhashes,'comments':[]}
        db.sync()

    print "Extracted {0} features from {1} ...".format(len(features),path)

Listing 5-10: Storing a sample’s minhashes in the shelve database by using its sketches as keys

We call get_database() ➊, getstrings() ➋, and minhash() ➌ and then iterate over our sample’s sketches starting at ➍. Next, to index our samples in the database, we use a technique known as inverted indexing, which allows us to store samples based on their sketch values instead of an ID. More specifically, for each of a sample’s 32 sketch values, we look up that sketch’s record in the database and append our sample’s ID to the list of samples associated with that sketch. Here, we use a sample’s filesystem path as its ID.

You can see how this is implemented in the code: we loop over the sketches we’ve computed for a sample ➍, we create a record for the sketch if it doesn’t already exist (associating our sample with the sketch while we’re at it) ➎, and finally, we add the sample path to the sketch’s set of associated sample paths if the sketch’s record does exist ➏.

Listing 5-11 shows the declaration of two important functions: comment_sample() and search_sample().

➊ def comment_sample(path):
      """
      Function that allows a user to comment on a sample.  The comment the
      user provides shows up whenever this sample is seen in a list of similar
      samples to some new samples, allowing the user to reuse their
      knowledge about their malware database.
      """
      db = get_database()
      comment = raw_input("Enter your comment:")
      if not path in db:
          store_sample(path)
      comments = db[path]['comments']
      comments.append(comment)
      db[path]['comments'] = comments
      db.sync()
      print "Stored comment:", comment

➋ def search_sample(path):
      """
      Function searches for samples similar to the sample provided by the
      'path' argument, listing their comments, filenames, and similarity values
      """
      db = get_database()
      features = getstrings(path)
      minhashes, sketches = minhash(features)
      neighbors = []

    ➌ for sketch in sketches:
          sketch = str(sketch)

          if not sketch in db:
              continue

        ➍ for neighbor_path in db[sketch]:
              neighbor_minhashes = db[neighbor_path]['minhashes']
              similarity = (neighbor_minhashes == minhashes).sum()
              / float(NUM_MINHASHES)
              neighbors.append((neighbor_path, similarity))

      neighbors = list(set(neighbors))
    ➎ neighbors.sort(key=lambda entry:entry[1], reverse=True)
      print ""
      print "Sample name".ljust(64), "Shared code estimate"
      for neighbor, similarity in neighbors:
          short_neighbor = neighbor.split("/")[-1]
          comments = db[neighbor]['comments']
          print str("[*] "+short_neighbor).ljust(64), similarity
          for comment in comments:
              print "\t[comment]",comment

Listing 5-11: Declaring functions that allow users to comment on samples and search for samples similar to a query sample

As expected, comment_sample() ➊ adds a user-defined comment record to a sample’s database record. This functionality is useful, because it allows users of the program to include insights gained from reverse-engineering a sample in the database such that when they see a new sample similar to samples they have comments for, they can leverage those comments to more rapidly understand the origins and purpose of the new sample.

Next, search_sample() ➋ leverages minhash to find samples similar to a query sample. To do this, first we extract string features, minhashes, and sketches from the query sample. Then, we iterate over the sample’s sketches, looking up samples stored in the database that also have that sketch ➌. For each sample that shares a sketch with the query sample, we compute its approximate Jaccard index using its minhashes ➍. Finally, we report the most similar samples to the query sample to the user, along with any comments associated with these samples that have been stored in the database ➎.

Listing 5-12 concludes our program’s code by implementing the argument-parsing part of our program.

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="""
Simple code-sharing search system which allows you to build up
a database of malware samples (indexed by file paths) and
then search for similar samples given some new sample
"""
    )

    parser.add_argument(
        "-l", "--load", dest="load", default=None,
        help="Path to malware directory or file to store in database"
    )

    parser.add_argument(
        "-s", "--search", dest="search", default=None,
        help="Individual malware file to perform similarity search on"
    )

    parser.add_argument(
        "-c", "--comment", dest="comment", default=None,
        help="Comment on a malware sample path"
    )

    parser.add_argument(
        "-w", "--wipe", action="store_true", default=False,
        help="Wipe sample database"
    )

    args = parser.parse_args()
  ➊ if args.load:
        malware_paths = []  # where we'll store the malware file paths
        malware_features = dict()  # where we'll store the malware strings
        for root, dirs, paths in os.walk(args.load):
            # walk the target directory tree and store all of the file paths
            for path in paths:
                full_path = os.path.join(root,path)
                malware_paths.append(full_path)

        # filter out any paths that aren't PE files
        malware_paths = filter(pecheck, malware_paths)

        # get and store the strings for all of the malware PE files
        for path in malware_paths:
            store_sample(path)

  ➋ if args.search:
        search_sample(args.search)

  ➌ if args.comment:
        comment_sample(args.comment)
  ➍ if args.wipe:
        wipe_database()

Listing 5-12: Performing similarity database updates and queries based on user command line arguments

Here, we allow users to load malware samples into the database so that these samples will be compared with new malware samples when users search similar samples in the database ➊. Next, we allow users to search for samples similar to the sample the user has passed in ➋, printing the results to the terminal. We also allow the user to comment on samples already in the database ➌. Finally, we allow the user to wipe the existing database ➍.

Running the Similarity Search System

Once you’ve implemented this code, you can run the similarity search system, which consists of four simple operations:

Load Loading the samples into the system stores them in the system database for future code-sharing searches. You can load samples individually or specify a directory, which the system will search recursively for PE files, loading them into the database. You can load samples into the database with the following command run in this chapter’s code directory:

python listings_5_7_to_5_12.py –l <path to directory or individual malware
sample>

Comment Commenting on a sample is useful because it allows you to store knowledge about that sample. Also, when you see new samples similar to this sample, a similarity search over those samples will reveal the comments you made on the older, similar sample, thus speeding up your workflow. You can comment on a malware sample with the following command:

python listings_5_7_to_5_12.py –c <path to malware sample>

Search Given a single malware sample, searching identifies all similar samples in the database and prints them in descending order of similarity. Also, any comments you might have made on those samples are printed. You can search for malware samples similar to a given sample using the following command:

python listings_5_7_to_5_12.py –s <path to malware sample>

Wipe Wiping the database simply clears all records from the system database, which you can do with the following command:

python listings_5_7_to_5_12.py –w

Listing 5-13 shows what it looks like when we load the APT1 samples into the system.

mds@mds:~/malware_data_science/ch5/code$ python listings_5_7_to_5_12.py -l ../
data
Extracted 240 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_A8F259BB36E00D124963CFA9B86F502E ...
Extracted 272 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_0149B7BD7218AAB4E257D28469FDDB0D ...
Extracted 236 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_CC3A9A7B026BFE0E55FF219FD6AA7D94 ...
Extracted 272 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_1415EB8519D13328091CC5C76A624E3D ...
Extracted 236 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_7A670D13D4D014169C4080328B8FEB86 ...
Extracted 243 attributes from ../data/APT1_MALWARE_FAMILIES/WEBC2-YAHOO/WEBC2-
YAHOO_sample/WEBC2-YAHOO_sample_37DDD3D72EAD03C7518F5D47650C8572 ...
--snip--

Listing 5-13: Sample output from the loading data into the similarity search system implemented in this chapter

And Listing 5-14 shows what it looks like when we perform a similarity search.

mds@mds:~/malware_data_science/ch5/code$ python listings_5_7_to_5_12.py –s \
../data/APT1_MALWARE_FAMILIES/GREENCAT/GREENCAT_sample/GREENCAT_sample_AB20\
8F0B517BA9850F1551C9555B5313
Sample name                                                      Shared code estimate
[*] GREENCAT_sample_5AEAA53340A281074FCB539967438E3F             1.0
[*] GREENCAT_sample_1F92FF8711716CA795FBD81C477E45F5             1.0
[*] GREENCAT_sample_3E69945E5865CCC861F69B24BC1166B6             1.0
[*] GREENCAT_sample_AB208F0B517BA9850F1551C9555B5313             1.0
[*] GREENCAT_sample_3E6ED3EE47BCE9946E2541332CB34C69             0.99609375
[*] GREENCAT_sample_C044715C2626AB515F6C85A21C47C7DD             0.6796875
[*] GREENCAT_sample_871CC547FEB9DBEC0285321068E392B8             0.62109375
[*] GREENCAT_sample_57E79F7DF13C0CB01910D0C688FCD296             0.62109375

Listing 5-14: Sample output from the similarity search system implemented in this chapter

Note that our system correctly determines that the query sample (a “greencat” sample) shares code with other greencat samples. If we didn’t have the luxury of already knowing these samples were members of the greencat family, our system would have just saved us a ton of reverse engineering work.

This similarity search system is only a small example of what would be implemented in a production similarity search system. But you should have no problem using what you learned so far to add visualization capabilities to the system and extend it to support multiple similarity search methods.

Summary

In this chapter, you learned how to identify shared code relationships between malware samples, compute code sharing similarity over thousands of malware samples to identify new malware families, determine a new malware sample’s code similarity to thousands of previously seen malware samples, and visualize malware relationships to understand patterns of code sharing.

You should now feel comfortable adding shared code analysis to your malware analysis toolbox, which will allow you to gain fast intelligence over large volumes of malware and accelerate your malware analysis workflow.

In Chapters 6, 7, and 8, you’ll learn to build machine learning systems for detecting malware. Combining these detection techniques with what you’ve already learned will help you catch advanced malware that other tools miss, as well as analyze its relationships to other known malware to gain clues about who deployed the malware and what their goals are.

5SHARED CODE ANALYSIS