Fuzzy Hashing

Hashing is one of the most common processes run in DFIR. This process allows us to summarize file content and assign a representative and repeatable signature that represents the file's content. We generally employ file and content hashes using algorithms such as MD5, SHA1, and SHA256. These hash algorithms are valuable as we can use them for integrity validation since a change to even one byte of a file's content will completely alter the resulting hash value. These hashes are also commonly used to form whitelists to exclude known or irrelevant content, or alert lists that quickly identify known interesting files. In some cases, though, we need to identify near matches—something that our MD5, SHA1, and SHA256 algorithms can't handle on their own.

One of the most common utilities that assists with similarity analysis is ssdeep, developed by Jessie Kornblum. This tool is an implementation of the spamsum algorithm, developed by Dr. Andrew Tridgell, which generates a base64 signature representing file content. These signatures can be used, independently of the file's content, to help to determine the confidence that two files are similar. This allows for a less computationally intense comparison of these two files and presents a relatively short signature that can be shared or stored easily.

In this chapter, we'll do the following:

Hash data using MD5, SHA1, and SHA256 algorithms with Python
Discuss how to hash streams of data, files, and directories of files
Explore how the spamsum algorithm works and implement a version in Python
Leverage the compiled ssdeep library via Python bindings for increased performance and features

The code for this chapter was developed and tested using Python 2.7.15 and Python 3.7.1.