Additional challenges

You've created a script that implements the spamsum algorithm to generate ssdeep compatible hashes! With this, there are a few additional challenges to pursue.

First, we're providing six sample files, found in the previously mentioned test_data/ directory. These files are available to confirm you're getting the same hashes as those printed and to allow you to perform some additional testing. The file_1, file_2, and file_3 files are our originals, whereas the instances with an appended a are a modified version of the original. The accompanying README.md file contains a description of the alterations we performed, though in short, we have the following:

We encourage you to perform additional testing to learn about how ssdeep responds to different types of alterations. Feel free to further alter the original files and share your findings with the community!

Another challenge is to study the ssdeep or spamsum code and learn how it handles the comparison component with the goal of adding it into the first script.

We can also explore developing code to expose the content of, for example, word documents and generate ssdeep hashes of the document's content instead of the binary file. This can be applied to other file types and doesn't have to be limited to text content. For example, if we discover that an executable is packed, we may also want to generate a fuzzy hash of the unpacked byte content.

Lastly, there are other similarity analysis utilities out there. To name one, the sdhash utility takes a different approach to identifying similarities between two files. We recommend you spend some time with this utility, running it against your and our provided test data to see how it performs with different types of modifications and alterations. More information on sdhash is available on the website: http://roussev.net/sdhash/sdhash.html.