Revisiting the main() function

This main() function is very similar to the prior script, though it has a few additional lines of code as we've added some functionality. This script starts the same with checking that the output type is a valid format, as shown on lines 56 through 62. We then add another conditional on line 63 that allows us to print the CSV header row since this output is more complicated than the last iteration:

047 def main(known_file, comparison, output_type):
048     """
049     The main function handles the main operations of the script
050     :param known_file: path to known file
051     :param comparison: path to look for similar files
052     :param output_type: type of output to provide
053     :return: None
054     """
055 
056     # Check output formats
057     if output_type not in OUTPUT_OPTS:
058         logger.error(
059             "Unsupported output format '{}' selected. Please "
060             "use one of {}".format(
061                 output_type, ", ".join(OUTPUT_OPTS)))
062         sys.exit(2)
063     elif output_type == 'csv':
064         # Special handling for CSV headers
065         print('"similarity","known_file","known_hash",'
066               '"comp_file","comp_hash"')

Now that we've handled the output format validation, let's pivot to our files for comparison. To start, we'll get the absolute path for both our known file and comparison path for consistency to our prior script. Then, on line 73, we check to ensure our known file exists. If it does, we then calculate the ssdeep hash on line 78. This calculation is completely handled by ssdeep; all we need to do is provide a valid file path to the hash_from_file() method. This method returns a string value containing our ssdeep hash, the same product as our fuzz_file() function in our prior script. The big difference here is speed improvements through the use of efficient C code running in the ssdeep module:

068     # Check provided file paths
069     known_file = os.path.abspath(known_file)
070     comparison = os.path.abspath(comparison)
071
072     # Generate ssdeep signature for known file
073     if not os.path.exists(known_file):
074         logger.error("Error - path {} not found".format(
075             comparison))
076         sys.exit(1)
077
078     known_hash = ssdeep.hash_from_file(known_file)

Now that we have our known hash value, we can evaluate our comparison path. In case this path is a directory, as shown on line 81, we'll walk through the folder and it's subfolders looking for files to process. On line 86, we generate a hash of this comparison file as we had for the known file. The next line introduces the compare() method, allowing us to provide two hashes for evaluation. This compare method returns an integer between (and including) 0 and 100, representing the confidence that these two files have similar content. We then take all of our parts, including the filenames, hashes, and resulting similarity, and provide them to our output function along with our formatting specification. This logic continues until we've recursively processed all of our files:

080     # Generate and test ssdeep signature for comparison file(s)
081     if os.path.isdir(comparison):
082         # Process files in folders
083         for root, _, files in os.walk(comparison):
084             for f in files:
085                 file_entry = os.path.join(root, f)
086                 comp_hash = ssdeep.hash_from_file(file_entry)
087                 comp_val = ssdeep.compare(known_hash, comp_hash)
088                 output(known_file, known_hash,
089                        file_entry, comp_hash,
090                        comp_val, output_type)

Our next conditional handles the same operations, but for a single file. As you can see, it uses the same hash_from_file() and compare() functions as in the directory operation. Once all of our values are assigned, we pass them in the same manner to our output() function. Our final conditional handles the case where an error on input is found, notifying the user and exiting:


092     elif os.path.isfile(comparison):
093         # Process a single file
094         comp_hash = ssdeep.hash_from_file(comparison)
095         comp_val = ssdeep.compare(known_hash, comp_hash)
096         output(known_file, known_hash, file_entry, comp_hash,
097                comp_val, output_type)
098     else:
099         logger.error("Error - path {} not found".format(
100             comparison))
101         sys.exit(1)