Generating our rolling hash

The following code block is our rolling hash function. Now, it may seem odd to have a function within a function, though this design has a few advantages. First, it's useful for organization. This rolling hash code block is only used by our fuzz_file() function and, by nesting it inside this function, we can inform the next person who reads our code that this is the case. Secondly, by placing this function within fuzz_file(), we can assure anyone who imports our code as a module doesn't misuse the rolling hash function. And while there are multiple other efficiencies and management reasons for selecting this design, we wanted to incorporate this feature into this script to introduce you to the concept. As you see in our other scripts, this isn't always used for specialized functions but is a tool that you can employ in your scripts to refine their design.

This nested function takes two arguments, shortened to nb for new_byte and rh for our rolling hash tracking dictionary. In our prior example, to calculate the rolling hash, we added the ASCII values of the entire window together. In this function, we'll perform a series of calculations to help us generate a rolling hash of a larger 7-byte window:

095     def update_rolling_hash(nb, rh):
096 """
097 Update the rolling hash value with the new byte
098 :param nb (int): new_byte as read from file
099 :param rh (dict): rolling hash tracking dictionary
100 :return: computed hash value to compare to reset_point
101 """

The rh rolling hash tracking dictionary is used to keep an eye on the moving parts within this rolling hash. There are three numbers that are stored as r1, r2, and r3. These numbers face additional calculations, as shown in the following code block, and the sum of the three are returned as the integer representing the rolling hash for that frame of the file.

The other two elements tracked by the dictionary are rn and rw. The rn key holds the offset the rolling hash is at within the file and is used to determine what character in the window is replaced by the nb, new_byte, value. This window, as you may have guessed, is stored in rw. Unlike our prior example where each character in the window was shifted left for each calculation of the rolling hash, this implementation just replaces the oldest character in the array. This improves efficiency as it results in one operation instead of eight:

102         # Calculate R2
103 rh['r2'] -= rh['r1']
104 rh['r2'] += (CONTEXT_WINDOW * nb)
105
106 # Calculate R1
107 rh['r1'] += nb
108 rh['r1'] -= rh['rw'][rh['rn'] % CONTEXT_WINDOW]
109
110 # Update RW and RN
111 rh['rw'][rh['rn'] % CONTEXT_WINDOW] = nb
112 rh['rn'] += 1
113
114 # Calculate R3
115 rh['r3'] = (rh['r3'] << 5) & 0xFFFFFFFF
116 rh['r3'] = rh['r3'] ^ nb
117
118 # Return the sum of R1 + R2 + R3
119 return rh['r1'] + rh['r2'] + rh['r3']

This logic is computationally the same as that used by ssdeep and spamsum. To start, we compute the r2 value by subtracting r1 and adding the product of CONTEXT_WINDOW and new_byte. We then update the r1 value by adding new_byte and subtracting the oldest byte within the window. This means that r1 stores the sum of the entire window, similarly to our entire rolling hash algorithm in the earlier example.

On line 111, we start updating our window, replacing the oldest byte with our new_byte character. After this, we increment the rn value so that it accurately tracks the offset within the file.

Finally, we calculate our r3 value, which uses some operations we haven't introduced at this point. The << operator is a bitwise operator that shifts our value to the left, in this case by five places. This is effectively the same as us multiplying our value by 2**5. The second new bitwise operator on line 115 is &,which in Python is a bitwise AND statement. This operator evaluates each bit for the values on either side of the operation, position by position, and if they're both equal to 1, they're enabled in that position of the output; otherwise, they're disabled. As a note, two 0 values in the same position do not result in 1 when using a bitwise AND statement. The third new bitwise operator is on line 116 and is ^, or the exclusive OR operator, also called an XOR operation. This works mostly as the opposite of our bitwise AND statement, where if the bits between the two values, position by position, are different, 1 is returned for that position and if they're the same, 0 is returned. 

More information on bitwise operations in Python is available at https://wiki.python.org/moin/BitwiseOperators.

With our bitwise operations out of the way, we return the sum of r1, r2, and r3 for further use in our fuzzy hash calculation.