Using ssdeep in Python – ssdeep_python.py

This script was tested with both Python 2.7.15 and 3.7.1, and requires the ssdeep version 3.3 third-party library.

As you may have noticed, the prior implementation is almost prohibitively slow. In situations like this, it's best to leverage a language, such as C, that can perform this operation much faster. Luckily for us, spamsum was originally written in C, then further expanded by the ssdeep project, also in C. One of the expansions the ssdeep project provides us with is Python bindings. These bindings allow us to still have our familiar Python function calls while offloading the heavy calculations to our compiled C code. Our next script covers the implementation of the ssdeep library in a Python module to produce the same signatures and handle comparison operations.

In this second example of fuzzy hashing, we're going to implement a similar script using the ssdeep Python library. This allows us to leverage the ssdeep tool and the spamsum algorithm, which has been widely used and accepted in the fields of digital forensics and information security. This code will be the preferred method for fuzzy hashing in most scenarios as it's more efficient with resources and produces more accurate results. This tool has seen wide support in the community, and many ssdeep signatures are available online. For example, the VirusShare and VirusTotal websites host hashes from ssdeep on their sites. This public information can be used to check for known malicious files that match or are similar to executable files on a host machine, without the need to download the malicious files.

One weakness of ssdeep is that it doesn't provide information beyond the matching percentage and can't compare files with significantly different block sizes. This can be an issue as ssdeep automatically generates the block size based on the size of the input file. The process allows ssdeep to run more efficiently and accommodates scaling much better than our script; however, it doesn't provide a manual solution to specify a block size. We could take our prior script and hardcode our block size, though that introduces other (previously discussed) issues.

This script starts the same as the other, with the addition of the new import of the ssdeep library. To install this library, run pip install ssdeep==3.3, or if that fails, you can run BUILD_LIB=1 pip install ssdeep==3.3 as per the documentation at https://pypi.python.org/pypi/ssdeep. This library wasn't built by the developer of ssdeep, but another member of the community who created the bindings Python needs to communicate with the C-based library. Once installed, it can be imported as seen on line 7:

001 """Example script that uses the ssdeep python bindings."""
002 import argparse
003 import logging
004 import os
005 import sys
006
007 import ssdeep

This iteration has a similar structure to our previous one, though we hand off all of our calculations to the ssdeep library. Though we may be missing our hashing and comparison functions, we're still using our main and output functions in a very similar manner:

047 def main():
...
104 def output(): 

Our program flow has also remained similar to our prior iteration, though it's missing the internal hashing function we developed in our prior iteration. As seen in the flow diagram, we still make calls to the output() function in the main() function:

Our argument parsing and logging configurations are nearly identical to the prior script. The major difference is that we've introduced one new file path argument and renamed our argument that accepted files or folders. On line 134, we once more create the argparse object to handle our two positional arguments and two optional output format and logging flags. The remainder of this code block is consistent with the prior script, with the exception of renaming our log files:

134 if __name__ == '__main__':
135 parser = argparse.ArgumentParser(
136 description=__description__,
137 epilog='Built by {}. Version {}'.format(
138 ", ".join(__authors__), __date__),
139 formatter_class=argparse.ArgumentDefaultsHelpFormatter
140 )
141 parser.add_argument('KNOWN',
142 help='Path to known file to use to compare')
143 parser.add_argument('COMPARISON',
144 help='Path to file or directory to compare to known. '
145 'Will recurse through all sub directories')