This script was tested with both Python versions 2.7.15 and 3.7.1 and doesn't leverage any third-party libraries.
While we'll get to the internals of the fuzzy hashing algorithm, let's start our script as we have the others. We begin with our imports, all standard libraries that we've used before as shown in the following. We also define a set of constants on lines 36 through 47. Lines 37 and 38 define our signature alphabet, in this case all of the base64 characters. The next set of constants are used in the spamsum algorithm to generate the hash. CONTEXT_WINDOW defines the amount of the file we'll read for our rolling hash. FNV_PRIME is used to calculate the hash while HASH_INIT sets a starting value for our hash. We then have SIGNATURE_LEN, which defines how long our fuzzy hash signature should be. Lastly, the OUTPUT_OPTS list is used with our argument parsing to show supported output formats—more on that later:
001 """Spamsum hash generator."""
002 import argparse
003 import logging
004 import json
005 import os
006 import sys
007
008 """ The original spamsum algorithm carries the following license:
009 Copyright (C) 2002 Andrew Tridgell <tridge@samba.org>
010
011 This program is free software; you can redistribute it and/or
012 modify it under the terms of the GNU General Public License
013 as published by the Free Software Foundation; either version 2
014 of the License, or (at your option) any later version.
015
016 This program is distributed in the hope that it will be useful,
017 but WITHOUT ANY WARRANTY; without even the implied warranty of
018 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
019 GNU General Public License for more details.
020
021 You should have received a copy of the GNU General Public License
022 along with this program; if not, write to the Free Software
023 Foundation, Inc.,
024 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
025
026 CHANGELOG:
027 Implemented in Python as shown below by Chapin Bryce &
028 Preston Miller
029 """
030
031 __authors__ = ["Chapin Bryce", "Preston Miller"]
032 __date__ = 20181027
033 __description__ = '''Generate file signatures using
034 the spamsum algorithm.'''
035
036 # Base64 Alphabet
037 ALPHABET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
038 ALPHABET += 'abcdefghijklmnopqrstuvwxyz0123456789+/'
039
040 # Constants for use with signature calculation
041 CONTEXT_WINDOW = 7
042 FNV_PRIME = 0x01000193
043 HASH_INIT = 0x28021967
044 SIGNATURE_LEN = 64
045
046 # Argument handling constants
047 OUTPUT_OPTS = ['txt', 'json', 'csv']
048 logger = logging.getLogger(__file__)
This script has three functions: main(), fuzz_file(), and output(). The main() function acts as our primary controller, handling the processing of directories versus single files and calling the output() function to display the result of the hashing. The fuzz_file() function accepts a file path and generates a spamsum hash value. The output() function then takes the hash and filename and displays the values in the specified format:
051 def main(file_path, output_type):
...
087 def fuzz_file(file_path):
...
188 def output(sigval, filename, output_type='txt'):
The structure of our script is fairly straightforward, as emphasized by the following diagram. As illustrated by the dashed line, the fuzz_file() function is the only function that returns a value. This is true as our output() function displays content on the console instead of returning it to main():
Finally, our script ends with argument handling and log initiation. For command-line arguments, we're accepting a path to a file or folder to process and the format of our output. Our output will be written to the console, with current options for text, CSV, and JSON output types. Our logging parameter is standard and looks very similar to our other implementations, with the notable difference that we're writing the log messages to sys.stderr instead so that the user can still interact with the output generated by sys.stdout:
204 if __name__ == '__main__':
205 parser = argparse.ArgumentParser(
206 description=__description__,
207 epilog='Built by {}. Version {}'.format(
208 ", ".join(__authors__), __date__),
209 formatter_class=argparse.ArgumentDefaultsHelpFormatter
210 )
211 parser.add_argument('PATH',
212 help='Path to file or folder to generate hashes for. '
213 'Will run recursively.')
214 parser.add_argument('-o', '--output-type',
215 help='Format of output.', choices=OUTPUT_OPTS,
216 default="txt")
217 parser.add_argument('-l', help='specify log file path',
218 default="./")
219
220 args = parser.parse_args()
221
222 if args.l:
223 if not os.path.exists(args.l):
224 os.makedirs(args.l) # create log directory path
225 log_path = os.path.join(args.l, 'fuzzy_hasher.log')
226 else:
227 log_path = 'fuzzy_hasher.log'
228
229 logger.setLevel(logging.DEBUG)
230 msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-20s"
231 "%(levelname)-8s %(message)s")
232 strhndl = logging.StreamHandler(sys.stderr) # Set to stderr
233 strhndl.setFormatter(fmt=msg_fmt)
234 fhndl = logging.FileHandler(log_path, mode='a')
235 fhndl.setFormatter(fmt=msg_fmt)
236 logger.addHandler(strhndl)
237 logger.addHandler(fhndl)
238
239 logger.info('Starting Fuzzy Hasher v. {}'.format(__date__))
240 logger.debug('System ' + sys.platform)
241 logger.debug('Version ' + sys.version.replace("\n", " "))
242
243 logger.info('Script Starting')
244 main(args.PATH, args.output_type)
245 logger.info('Script Completed')
With this framework, let's explore how our main() function is implemented.