Hashing large files – hashing

Our first script in this chapter is short and to the point; it'll allow us to hash a provided file's content with a specified cryptographic algorithm. This code will likely be more useful as a feature within a larger script, such as our file listing utility; we'll demonstrate a standalone example to walk through how to handle hashing files in a memory-efficient manner.

To begin, we only need two imports, argparse and hashlib. Using these two built-in libraries, we'll be able to generate hashes, as shown in the prior example. On line 33, we list out the supported hash algorithms. This list should only contain algorithms available as a module within hashlib, as we'll call (for example) md5 from the list as hashlib.md5(). The second constant defined, on line 34, is BUFFER_SIZE, which is used to control how much of a file to read at a time. This value should be smaller, 1 MB in this instance, to preserve the amount of memory required per read, although we also want a number large enough to limit the number of reads we have to perform on the file. You may find this number is adjusted based on the system you choose to run it on. For this reason, you may consider specifying this as an argument instead of a constant:

001 """Sample script to hash large files effiently."""
002 import argparse
003 import hashlib
...
033 HASH_LIBS = ['md5', 'sha1', 'sha256', 'sha512']
034 BUFFER_SIZE = 1024**3

Next, we define our arguments. This is very brief as we're only accepting a filename and an optional algorithm specification:

036 parser = argparse.ArgumentParser()
037 parser.add_argument("FILE", help="File to hash")
038 parser.add_argument("-a", "--algorithm",
039     help="Hash algorithm to use", choices=HASH_LIBS,
040     default="sha512")
041 args = parser.parse_args()

Once we know the specified arguments, we'll translate the selected algorithm from an argument into a function we can call. To do this, we use the getattr() method as shown on line 43. This built-in function allows us to retrieve functions and properties from an object (such as a method from a library, as shown in the following code). We end the line with () as we want to call the specified algorithm's initialization method and create an instance of the object as alg that we can use to generate the hash. This one-liner is the equivalent of calling alg = hashlib.md5() (for example), but performed in an argument-friendly fashion:

043 alg = getattr(hashlib, args.algorithm)()

On line 45, we open the file for reading, which we start on line 47 by reading the first buffer length into our buffer_data variable. We then enter a while loop where we update our hash algorithm object on line 49 before getting the next buffer of data on line 50. Luckily for us, Python will read all of the data from input_file, even if BUFFER_SIZE is greater than what remains in the file. Additionally, Python will exit the loop once we reach the end of the file and close it for us when exiting the with context. Lastly, on line 52, we print the .hexdigest() of the hash we calculated:

045 with open(args.FILE, 'rb') as input_file:
046 
047     buffer_data = input_file.read(BUFFER_SIZE)
048     while buffer_data:
049         alg.update(buffer_data)
050         buffer_data = input_file.read(BUFFER_SIZE)
051 
052 print(alg.hexdigest())