This main() function is very similar to the prior script, though it has a few additional lines of code as we've added some functionality. This script starts the same with checking that the output type is a valid format, as shown on lines 56 through 62. We then add another conditional on line 63 that allows us to print the CSV header row since this output is more complicated than the last iteration:
047 def main(known_file, comparison, output_type):
048 """
049 The main function handles the main operations of the script
050 :param known_file: path to known file
051 :param comparison: path to look for similar files
052 :param output_type: type of output to provide
053 :return: None
054 """
055
056 # Check output formats
057 if output_type not in OUTPUT_OPTS:
058 logger.error(
059 "Unsupported output format '{}' selected. Please "
060 "use one of {}".format(
061 output_type, ", ".join(OUTPUT_OPTS)))
062 sys.exit(2)
063 elif output_type == 'csv':
064 # Special handling for CSV headers
065 print('"similarity","known_file","known_hash",'
066 '"comp_file","comp_hash"')
Now that we've handled the output format validation, let's pivot to our files for comparison. To start, we'll get the absolute path for both our known file and comparison path for consistency to our prior script. Then, on line 73, we check to ensure our known file exists. If it does, we then calculate the ssdeep hash on line 78. This calculation is completely handled by ssdeep; all we need to do is provide a valid file path to the hash_from_file() method. This method returns a string value containing our ssdeep hash, the same product as our fuzz_file() function in our prior script. The big difference here is speed improvements through the use of efficient C code running in the ssdeep module:
068 # Check provided file paths
069 known_file = os.path.abspath(known_file)
070 comparison = os.path.abspath(comparison)
071
072 # Generate ssdeep signature for known file
073 if not os.path.exists(known_file):
074 logger.error("Error - path {} not found".format(
075 comparison))
076 sys.exit(1)
077
078 known_hash = ssdeep.hash_from_file(known_file)
Now that we have our known hash value, we can evaluate our comparison path. In case this path is a directory, as shown on line 81, we'll walk through the folder and it's subfolders looking for files to process. On line 86, we generate a hash of this comparison file as we had for the known file. The next line introduces the compare() method, allowing us to provide two hashes for evaluation. This compare method returns an integer between (and including) 0 and 100, representing the confidence that these two files have similar content. We then take all of our parts, including the filenames, hashes, and resulting similarity, and provide them to our output function along with our formatting specification. This logic continues until we've recursively processed all of our files:
080 # Generate and test ssdeep signature for comparison file(s)
081 if os.path.isdir(comparison):
082 # Process files in folders
083 for root, _, files in os.walk(comparison):
084 for f in files:
085 file_entry = os.path.join(root, f)
086 comp_hash = ssdeep.hash_from_file(file_entry)
087 comp_val = ssdeep.compare(known_hash, comp_hash)
088 output(known_file, known_hash,
089 file_entry, comp_hash,
090 comp_val, output_type)
Our next conditional handles the same operations, but for a single file. As you can see, it uses the same hash_from_file() and compare() functions as in the directory operation. Once all of our values are assigned, we pass them in the same manner to our output() function. Our final conditional handles the case where an error on input is found, notifying the user and exiting:
092 elif os.path.isfile(comparison):
093 # Process a single file
094 comp_hash = ssdeep.hash_from_file(comparison)
095 comp_val = ssdeep.compare(known_hash, comp_hash)
096 output(known_file, known_hash, file_entry, comp_hash,
097 comp_val, output_type)
098 else:
099 logger.error("Error - path {} not found".format(
100 comparison))
101 sys.exit(1)