Moving back into our fuzz_file() function, we evaluate the provided file to see whether it has any content, and if so, open the file. We store this file size for later use:
122 fsize = os.stat(file_path).st_size
123 if fsize == 0:
124 logger.warning("File is 0-bytes. Skipping...")
125 return ""
126 open_file = open(file_path, 'rb')
We now start with our first factor in the hashing algorithm, the reset point. This value is noted as the first value in a signature, as it's used to determine what hashes can be compared. To calculate this number, we start with 3, as selected in the spamsum algorithm as a minimum reset point. We then double the reset point, as shown on line 130, until it's larger than the filesize / 64:
129 reset_point = 3
130 while reset_point * 64 < fsize:
131 reset_point *= 2
Once we have our initial reset point, we read our file into memory as bytearray since we want to read each character as a byte that we can interpret. We then set up our while loop, which we'll use to adjust the reset_point size if we need to—more on that later on:
134 complete_file = bytearray(open_file.read())
135 done = False
136 while not done:
Once within our while loop, we'll initiate our hashing objects. The first object is rolling_hash, a dictionary with five keys. The r1, r2, and r3 keys are used to compute the hash; the rn key tracks the position of the cursor in the file; the rw key holds a list the size of the CONTEXT_WINDOW constant. This is the dictionary that's referenced heavily in our update_rolling_hash() function. It may be helpful to re-read that section now that you've seen what the rolling_hash dictionary looks like.
Following this dictionary, we have trad_hash1 and trad_hash2 initialized with the HASH_INIT constant. Lastly, we initialize the two signatures, sig1 and sig2. The variable trad_hash1 is used to populate the sig1 value, and similarly, trad_hash2 is used to populate the sig2 value. We'll show how we calculate these traditional hashes and update our signatures shortly:
138 rolling_hash = {
139 'r1': 0,
140 'r2': 0,
141 'r3': 0,
142 'rn': 0,
143 'rw': [0 for _ in range(CONTEXT_WINDOW)]
144 }
145 trad_hash1 = HASH_INIT
146 trad_hash2 = HASH_INIT
147 sig1 = ""
148 sig2 = ""
Once we've initialized our hash values, we can start iterating through the file as seen on line 151. On line 153, we calculate the rolling hash using the latest byte from the file and the rolling_hash dictionary. Remember that dictionaries can be passed into a function and updated, and can retain their updated values outside of the function without needing to be returned. This allows a simpler interface with our rolling hash function. This function simply returns the calculated rolling hash, which is in the form of an integer as previously discussed. This rolling hash allows us to hash a moving (or rolling) window of data through a byte stream and is used to identify when in our file we should add a character to our signature:
151 for new_byte in complete_file:
152 # Calculate our rolling hash
153 rh = update_rolling_hash(new_byte, rolling_hash)
After calculating the rolling hash value, we need to update our traditional hashes. These hashes use the Fowler–Noll–Vo (FNV) hash, where we multiply the former value of the hash against the fixed prime, defined as one of our constants, before being XOR'd (^ as previously discussed) against the new byte of data. Unlike the rolling hash, these hash values continue to increment with each new byte and grow in size until we reach one of our boundaries:
156 trad_hash1 = (trad_hash1 * FNV_PRIME) ^ new_byte
157 trad_hash2 = (trad_hash2 * FNV_PRIME) ^ new_byte
These boundaries are evaluated by two conditionals, one for each of our hash/signature pairs. Lines 161 through 164 are functionally equivalent to lines 165 through 168, with the exception of which traditional hash and signature is in use. For simplicity, we'll walk through the first.
On lines 161 and 162 (due to line wrapping), we have our first conditional statement, which evaluates whether the product of our rolling hash modulo reset_point, is equal to reset_point - 1. We also ensure that our overall signature length is less than the maximum signature length minus 1. If these conditions are met, we've reached a boundary and will convert our traditional hash into a character of our signature, as shown on line 163. After adding a character to our signature, we then reset our traditional hash back to the initial value, meaning the next block of data will have a hash value starting from the same point as the prior block.
As mentioned, this is repeated for the second signature, with the notable exception that the second signature is modifying reset_point (by multiplying it by two) and the maximum signature length (by dividing it by two). This second reset point was added to address the desire for the spamsum signature to be short—64 characters by default. This means that the primary signature may be cut off and the tail of the file may represent one character of the signature. To combat this, spamsum added the second signature to generate a value that represents more, if not all, of the file. This second signature effectively has a reset_point twice as large as the first signature:
159 # Check if our rolling hash reaches a reset point
160 # If so, update sig and reset trad_hash
161 if (rh % reset_point == reset_point - 1
162 and len(sig1) < SIGNATURE_LEN - 1):
163 sig1 += ALPHABET[trad_hash1 % 64]
164 trad_hash1 = HASH_INIT
165 if (rh % (reset_point * 2) == (reset_point * 2) - 1
166 and len(sig2) < (SIGNATURE_LEN / 2) - 1):
167 sig2 += ALPHABET[trad_hash2 % 64]
168 trad_hash2 = HASH_INIT
This is the end of our for loop; this logic will repeat until we've reached the end of the file, though the signatures will only grow to 63 and 31 characters in length, respectively. After our for loop exists, we evaluate whether we should start the while loop (beginning on line 136) over again. We would want to do this if our first signature was less than 32 characters and our reset_point wasn't the default value of 3. If we have too short a signature, we halve our reset_point value and re-run our entire calculation again. This means that we need every efficiency possible within this while loop, as we could be re-processing content over and over:
170 # If sig1 is too short, change block size and recalculate
171 if len(sig1) < SIGNATURE_LEN / 2 and reset_point > 3:
172 reset_point = reset_point // 2
173 logger.debug("Shortening block size to {}".format(
174 reset_point))
175 else:
176 done = True
If our signature length is greater than 32 characters, we exit our while loop and generate the last character for our signature. If the product of our rolling hash isn't equal to zero, we add the last character to each signature, as shown on lines 180 and 181:
178 # Add any values from the tail to our hash
179 if rh != 0:
180 sig1 += ALPHABET[trad_hash1 % 64]
181 sig2 += ALPHABET[trad_hash2 % 64]
182
183 # Close the file and return our new signature
184 open_file.close()
185 return "{}:{}:{}".format(reset_point, sig1, sig2)
At this point, we can close the file and return our full spamsum/ssdeep signature. This signature has three, hopefully recognizable, parts:
- Our reset_point value
- The primary signature
- The secondary signature