© Seth James Nielson, Christopher K. Monson 2019
S. J. Nielson, C. K. MonsonPractical Cryptography in Pythonhttps://doi.org/10.1007/978-1-4842-4900-0_2

2. Hashing

Seth James Nielson1  and Christopher K. Monson2
(1)
Austin, TX, USA
(2)
Hampstead, MD, USA
 
Hashing is a cornerstone of cryptographic security. It involves the concept of a one-way function or fingerprint. Hash functions only work well when a couple of things are true about them:
  • They produce repeatable, unique values for every input.

  • The output value provides no clues about the input that produced it.

Some hashing functions are better at satisfying these requirements than others, and we’ll talk about some good ones (SHA-256) and some not-so-good ones (MD5, SHA-1) to demonstrate both how they work and why choosing a good one is so terribly important.

Hash Liberally with hashlib

WARNING: MD5 Is No Good

We are going to use an algorithm called MD5 for about the first half of the chapter. MD5 is deprecated and should not be used for any security-sensitive operations, or really any operations at all, except when you have to interact with legacy systems.

This discussion is for introducing the concept of hashing and for providing historical context. MD5 is nice for that because it produces short hashes, has a rich history, and gives us something to break.

When we last left our two favorite spies from East Antarctica, Alice and Bob were working out some codes using simple substitution ciphers. Even though the cipher was very weak, it provided a rudimentary form of message confidentiality.

It did nothing, however, for message integrity. If you haven’t already guessed, message confidentiality means that nobody but the authorized parties can read the message. Message integrity means that no unauthorized parties can change the message without the change being noticed.

It is important to understand the distinction. Even with modern ciphers, just because a message can’t be read doesn’t mean it can’t be altered, even in ways that make sense after decryption.

Also, when Alice and Bob go through customs at the WA border, sometimes their laptops are inspected. It would be nice to know that none of the files have been tampered with during that process.

Fortunately for Alice and Bob, their new technology officer introduces them to something called a “message digest” to “fingerprint” their files and message transmissions. He explains that they can combine their messages’ contents with message digests, then using the two together, they can tell whether part of any message was altered. That sounds like just the thing!

Since they don’t know anything about digests, it’s time for some training. Let’s follow along with their instructor in our own Python interpreter, starting with Listing 2-1.
>>> import hashlib
>>> md5hasher = hashlib.md5()
>>> md5hasher.hexdigest()
'd41d8cd98f00b204e9800998ecf8427e'
Listing 2-1

Intro to hashlib

Importing a library called hashlib seems straightforward enough, but what is md5?

The instructor explains that the “MD” in MD5 stands for “message digest.” We’ll get into some interesting details in just a moment, but for now, a digest like MD5 converts a document of any length (even an empty document) into a large number that takes up a fixed amount of space. It should have at least these features:
  • The same document always produces the same digest.

  • The digest “feels” random: if you have a digest, it gives you no clues about the document.

In this way, a digest is like a fingerprint and is sometimes called one: it is a small amount of data that stands in for the document’s identity; every document we might ever care about should have a completely unique digest.

Human fingerprints are similar in other ways. If you have a person at hand, it’s easy to produce a (relatively) consistent and unique fingerprint; but if the only thing you have is a fingerprint, it’s not so easy to find out whose it is. Digests work the same way: given a document, it’s easy to calculate its digest; but given only a digest, it’s very hard to find out what document produced it. Very hard. The harder, the better, in fact.

The MD5 digest creates a number that always occupies 16 bytes of memory. In our example interpreter session, we asked it to produce a digest for the empty document, which is why we didn’t add any data to the md5hasher before asking it to produce a digest for us. The use of hexdigest is shown to demonstrate a more human-readable format for the number, where each of the 16 bytes in the digest is shown as a two-character hexadecimal value.

The instructor, anxious to move on, asks Alice and Bob to hash each of their names (expressed as bytes). To the interpreter, and Listing 2-2!
>>> md5hasher = hashlib.md5(b'alice')
>>> md5hasher.hexdigest()
'6384e2b2184bcbf58eccf10ca7a6563c'
>>> md5hasher = hashlib.md5(b'bob')
>>> md5hasher.hexdigest()
'9f9d51bc70ef21ca5c14f307980a29d8'
Listing 2-2

Hash Names

For short strings like these, it’s not uncommon to combine operations, like Listing 2-3.
>>> hashlib.md5(b'alice').hexdigest()
'6384e2b2184bcbf58eccf10ca7a6563c'
>>> hashlib.md5(b'bob').hexdigest()
'9f9d51bc70ef21ca5c14f307980a29d8'
Listing 2-3

Combine Operations

“So, Alice, Bob, what did you learn from this?” the instructor asks. When neither one answers, she suggests that they experiment some more. Let’s follow along.

Python differentiates between Unicode strings and raw byte strings. A full explanation of the differences is beyond the scope of this book, but for almost all cryptographic purposes, you must use bytes. Otherwise you can end up with some very nasty surprises when the interpreter attempts (or refuses) to convert Unicode strings into bytes for you. We forced our string literals to be bytes using the b'' string syntax. In other examples where user input requires us to start with Unicode strings, we will encode those to bytes ensuring that it is safe to do so.

Exercise 2.1. Welcome To MD5

Compute more digests. Try computing the MD5 sum of the following inputs:
  • b'alice' (again)

  • b'bob' (again)

  • b'balice'

  • b'cob'

  • b'a'

  • b'aa'

  • b'aaaaaaaaaa' (ten copies of the letter “a”)

  • b'a'*100000 (100,000 copies of the letter “a”)

What did you learn about MD5 sums from Exercise 2.1? We will talk about these further in the chapter, but let’s jump back to our intrepid Antarcticans.

“Okay, Alice and Bob,” the instructor says. “A couple of things. These digest objects don’t require the entire input all at once. It can be inserted a chunk at a time using the update method ,” as shown in Listing 2-4.
>>> md5hasher = hashlib.md5()
>>> md5hasher.update(b'a')
>>> md5hasher.update(b'l')
>>> md5hasher.update(b'i')
>>> md5hasher.update(b'c')
>>> md5hasher.update(b'e')
Listing 2-4

Hash Update

The instructor asks Alice and Bob, “What do you think the output of the md5hasher.hexdigest() instruction will be?” Try it out and see if you got it right!

“Great,” the instructor says when they’ve finished. “Your introductory training is almost over. Just one more exercise!”

Exercise 2.2. Google Knows!

Do a quick Google search using the following hashes (enter the hashes literally into the Google search bar):
  1. 1.

    5f4dcc3b5aa765d61d8327deb882cf99

     
  2. 2.

    d41d8cd98f00b204e9800998ecf8427e

     
  3. 3.

    6384e2b2184bcbf58eccf10ca7a6563c

     

Making a Hash of Education

Within the realm of computer security, the terms “hashing” or “hash function” always refer to cryptographic hash functions, unless otherwise stated. There are some very useful non-cryptographic hash functions as well. In fact, you were taught a very simple one in grade school: computing whether a number is even or odd. Let’s see how this simple, familiar function illustrates principles that apply to all hash functions.

Hash functions are fundamentally trying to map an enormous (even infinite) number of things onto a (relatively) small set of things. When using MD5, for example, no matter how big our document is, we end up with a 16-byte number. In discrete algebra terms, this means that the domain of a hash function is much larger than its range. Given a very, very large number of documents, chances are that many of them will produce the same hash.

Hash functions are therefore lossy. We lose information going from our source document to a digest or hash. This is actually critical to their function, because without a loss of information, there would be a way to go backward from the hash to the document. We really don’t want that, and we’ll see why soon.

Thus, computing whether a number is even or odd fits this description quite well. No matter how large or interesting the (integer) number, we can squash it down into a single bit of space: 1 for odd, 0 for even. That’s a hash! Given any number of any size, we can efficiently produce its “oddness” value, but given its oddness, we would be hard-pressed to figure out which number produced it. We can create a very, very large number of possible inputs, but we can’t know which one specifically was used to make that answer.

An “even or odd” bit is sometimes called a “parity” bit and has often been used as a rudimentary error detection code.

The even/odd hash example illustrates this principle of “squashing” an input down to a fixed size value. This value is consistent, meaning you won’t get a different value out if you put the same number into it twice. It compresses large inputs into a fixed-size space (just one bit!), and it is lossy: you can’t tell me which number was used as input by examining only the output.

All hash functions, including non-cryptographic hash functions, have the fundamental qualities of consistency, compression, and lossiness and have all kinds of important applications in computer science. These qualities alone, however, are not enough for a hash function to be cryptographic or secure: for those, a hash function needs a few more properties [11, Chap. 9]:
  • Preimage resistance

  • Second-preimage resistance

  • Collision resistance

We’ll talk about each of these important qualities in turn.

Preimage Resistance

Informally, a preimage is the set of inputs for a hash function that produce a specific output. If we were to apply that to our parity example from earlier, the preimage for an odd parity bit is the (infinite!) set of all odd integers. Similarly, the preimage for an even parity bit is the set of all even integers.

What does this mean for a cryptographic hash? Earlier, we computed that the MD5 hash value 6384e2b2184bcbf58eccf10ca7a6563c could be generated by the input b'alice'. Thus, the preimage of
  • MD5(x) = 6384e2b2184bcbf58eccf10ca7a6563c

contains the element x == b'alice'.

This is important, so let’s state it in more precise terms (using integers in our domain and range—remember, a document is ordered bits and is therefore just a big integer):

Preimage: A preimage for a hash function H and a hash value k is the set of values of x for which H(x) = k.

For cryptographic hash functions, this concept of the preimage is important. If I give you a digest value, there might (should) be infinitely many input numbers that could be used to produce it. Those numbers are the preimage for that digest. Remember, every document is just a large integer number from the computer’s point of view. It’s all just bytes, and we’re just performing a mathematical operation on them. The preimage is therefore just an infinite set of integer numbers.1

The idea of preimage resistance is basically this: if you hand me a digest and I don’t already know how you got it, I can’t even find one element in the preimage for it without doing a ridiculous amount of work. Ideally I would have to do an impossible amount of work.

It’s already hard to (in general) find the entire preimage; it’s way too big. What we’re really interested in is making it tough to find any element in the preimage unless you already happen to know one. That’s where lossiness comes in: the digest should give us no information whatsoever about the document that produced it. Without any information to guide us, the best we can do is random guessing or trying everything until we accidentally land on one that produces the right digest. That is preimage resistance.

The process of attempting to find an element in the preimage for a given output is called inverting the hash: trying to run it backward to get an input for a given output. Preimage resistance means that finding any inverse is hard .

This is why the even/odd function is a potentially useful hash function, but not a secure hash function. If I give you an even/odd value, you can readily come up with something that matches. I say “even,” you say “2,” for example. That’s not very preimage-resistant, because you just told me an input that produces the given output, and you didn’t have to think very hard to do it. In fact, you can describe the entire preimage without much trouble: “all even integers.” For cryptographic hash functions, if I tell you MD5(x) = ca8a0fb205782051bd49f02eae17c9ee, you (ideally) can’t tell me what x is unless you can find someone who already knows and is willing to tell you. MD5 is hard to invert.

Now, you could just try random (or ordered) documents to see if any of them produce ca8a0fb205782051bd49f02eae17c9ee, and you might get (very!) lucky. That approach is one kind of a brute-force attack because you just have to pick your way through every single straw in the haystack to find the needle you are looking for. All you can do is commit to looking at an awful lot of straw, relying on raw stamina to carry you through.

Because consistency is a property of hashes, if you already happen to have an input that maps to a given output, or you can find it by searching Google, for example, then that particular output is trivially inverted. The ASCII text “alice” always maps to “6384e2b2184bcbf58eccf10ca7a6563c” when run through MD5, no matter what, so if you happen to know that those two things go together, you can easily find “alice” from the digest. For that particular output, MD5 is trivially inverted. That doesn’t mean MD5 is not preimage-resistant, though: to break that you would need to find an easy way to always find an input given an output without knowing one beforehand.

That leaves us with brute force again. How long would it take you to “guess” an element of the preimage for an MD5 hash using a brute-force technique (either random guessing or sequential searching)? To answer this, we first need to look at how many possible hash values there are. We know that MD5 always produces a 16-byte digest, and we can use that to figure out how hard it should ideally be to invert MD5. To do so, we’ll need some understanding of binary (base-2), decimal (base-10), and hexadecimal (base-16) positive integers (plus 0, but we usually just say “non-negative”).

If you already understand those pretty well, feel free to skip to the next section.

Byte into Some Non-negative Integers

Most computers use binary to represent everything. The binary number system is represented in base 2. One nice way to be introduced to it is through counting. Here we have the familiar base-10 (decimal) number on the left and the corresponding binary number on the right:
0     0
1     1
2    10
3    11
4   100
5   101
6   110
7   111
8  1000
9  1001

How does counting work in this system? We start with 0, which is nice and familiar. Adding 1 gives us 1, which is expected. So far, so good. But then, since we’re in base 2, we run out of digits when we try to do it again! Just like there is no single digit representing the number “10” in our decimal system, there is no single digit representing “2” in binary!

What do we do when we run out of digits in base 10? We use place values. The very number “10” shows this: there is “1 ten” and “0 ones” in that number. It’s the number that comes after “9.”

Binary is similar. When we move up one number from “1,” we run out of digits, so we put a “1” in the “twos column” and start over at “0” in the “ones column.”

What might seem remarkable is the fact that you can represent every non-negative integer this way, just like you can with decimal. The base value (“base-2,” “base-10,” “base-16,” etc.) tells you how many digits you have to work with and therefore what the place values mean. Here are a few place values in different number systems. Note that people get a little bit careless with these things, using decimal to talk about them, but really the number system is arbitrary. When it comes to that, there are ten kinds of people in the world: those who understand binary and those who don’t.2

That’s a big problem when teaching number systems: what does “10” mean without knowing what base we’re operating in? The default is to assume it means “ten” unless the base is explicitly stated, or is really obvious, like with hexadecimal, where we see “a”–“f” along with the more common decimal digits. We’ll do that here too: if you don’t see a base or you can’t easily tell what it is, you are looking at decimal.

 

Place 3

Place 2

Place 1

Place 0

Binary

8

4

2

1

Decimal

1000

100

10

1

Hexadecimal

4096

256

16

1

Or, put another way:
 

Place 3

Place 2

Place 1

Place 0

Binary

23

22

21

20

Decimal

103

102

101

100

Hexadecimal

163

162

161

160

All of these number systems work in the same way: place value is determined by adding one to an exponent on the base.

In decimal, therefore, the number 237 really means 2 ⋅ 102 + 3 ⋅ 101 + 7 ⋅ 100 = 200 + 30 + 7.

The same number in hex (we’ll use xh to mean “x in hexadecimal”) is edh, which means $$ {e}_h\cdot {10}_h^1+{d}_h\cdot {10}_h^0 $$. But what does that mean? Well, eh = 14d in decimal, and dh = 13d. Since 10h has a 1 in the sixteens column, we get (in decimal) 14 ⋅ 16 + 13 = 237.

Why do we care about hexadecimal in the first place, other than its relative compactness? Hexadecimal (or “hex”) is useful because its place values are multiples of 2 (they’re multiples of 24, to be exact), so it lines up nicely with binary. Consider the following table with hex on the left and binary on the right:
0     0
1     1
2    10
3    11
4   100
5   101
6   110
7   111
8  1000
9  1001
A  1010
B  1011
C  1100
D  1101
E  1110
F  1111
We ran out of digits in hex at exactly the same time that we needed to go from four columns to five in binary! That’s really helpful, because it means we can trivially convert back and forth between a computer’s native and sprawling binary numbers to the much more human-friendly and compact hex numbers. People even get good enough at this that they can just translate them on sight. Here’s an example with binary on top and hex underneath:
101 1100 1010 0011 0111
5   c    a    3    7

No matter how big a binary number gets, you can take every four bits and write them as a single hexadecimal digit.

The point of walking through this review of binary is to emphasize once again that every sequence of bits in a computer is a number. What if those bits are a document? That’s a number. What about if they represent an image? That’s just a big number.

The “meaning” of those bits is not in the computer, it’s in our minds.

We may display the bits a certain way, but we humans choose to do that based on what we think they mean. The computer has no idea what they really mean. They’re just numbers. Can we store the meaning itself somehow? Well, sure, but that would force us to encode the meaning as a number, because numbers are all that computers understand. Even their instructions are just numbers.

Philosophize much, do we? It’s actually a pretty important thing to understand if you really want to know how computers work, and we definitely need people in the world who do. Data and code are just big numbers, and computers basically just fetch, store, and do arithmetic on them.

How Hard a Hash!

With that little side trip, we can now answer what we wanted to answer in the first place: how hard would it be, in general, to invert MD5 using brute force? We can take a stab at this by looking at the size of its output. MD5 outputs a value in 16 bytes, which is 16 ⋅ 8 = 128 bits. With n bits we can express 2n individual values, so MD5 can output a lot of different digests. This many, in fact (in decimal):3

340,282,366,920,938,463,463,374,607,431,768,211,456.

Even if you checked 1 million values per second (and were guaranteed that nothing you checked produced an output you had seen before), it would still take you about 1026 years (100 million billion billion!) to find a suitable input by brute force. By comparison, our sun is only expected to continue sustaining earth life for at most another 5 billion years; your computer would need to keep running for many, many times that long.

If you have a cryptographic algorithm whose only means of being broken is brute force, you have a good algorithm. The trouble is, you don’t necessarily know that it’s good. But this gives us an upper bound on how long it would take to find an input that produces a particular hash in MD5. At least it won’t take longer!

Second-Preimage and Collision Resistance

Once preimage resistance is understood, the other two properties are relatively easy to grasp. We wandered a bit into brute force and binary near the end of that last section, so let’s quickly review:
  • Preimage resistance means it is really hard to find a document that produces a particular digest, unless you already know one.

Second-Preimage Resistance

Second-preimage resistance means that if you already have one document that produces a particular digest, it’s still really hard to find a different document that produces that same digest.

In other words, just because you know that
  • MD5(alice) = 384e2b2184bcbf58eccf10ca7a6563c,

it doesn’t mean you can find another value to put into $$ MD5\left(\cdot \right) $$ that will give you the same value. You would have to resort to brute force again.

To tie it back to its name, if you have one member of the preimage already, it is not any easier to find a second member of the preimage: there is no exploitable pattern in the preimage .

Collision Resistance

Collision resistance is a bit more subtle than either of the preimage characteristics we just mentioned. Collision resistance means that it’s hard to find any two inputs that produce the same output: not a specific output, just the same output.

A classic way of describing this is by using birthdays.4 Suppose you are in a room full of people and you want to find two of them whose birthday is February 3. How likely is that? Not necessarily very likely, if you really picked it at random.

But now let’s say you want to do something else. You want to know whether any two people have the same birthday. You don’t care what day of the year it falls on, you just want to know whether anyone’s birthday overlaps with anyone else’s. How likely is that? It turns out that, in general, it is far, far more likely. After all, we just removed the constraint of a particular day, and now all we want is a collision on any day.

That’s the basic idea behind collision resistance. When a hash algorithm is resistant to collision, it is resistant to being able to purposefully create or pick any two inputs that produce the same digest, without deciding what that digest should be beforehand .

MD5 appears to be fairly collision-resistant. One property that helps with this is the fact that small changes in input can produce very large changes in output. Consider Exercise 2.1, where you produced hashes for very similar values like “a” and “aa,” or “bob” and “cob.” The digests resulting from performing MD5 on these values were not just different, they were wildly different :
bob: 9f9d51bc70ef21ca5c14f307980a29d8
cob: 386685f06beecb9f35db2e22da429ec9
There is no discernible pattern that would tie one to the other. This is due to a property shared by many hashes and cryptographic ciphers called the avalanche property : a change to the input, no matter how small, creates a large and unpredictable change in the output. Ideally, 50% of the output bits should be altered for small input changes [11, Chap. 7]. Did we achieve that with “bob” and “cob”? Let’s take a look at the digests in binary using some Python to aid our exploration (note that our bit string is quite long, so it is broken over two lines in Listing 2-5).
>>> hexstring = hashlib.md5(b'bob').hexdigest()
>>> hexstring
'9f9d51bc70ef21ca5c14f307980a29d8'
>>> binstring = bin(int(hexstring, 16))
>>> print("{}\n{}".format(binstring[2:66], binstring[66:]))
1001111110011101010100011011110001110000111011110010000111001010
0101110000010100111100110000011110011000000010100010100111011000
Listing 2-5

Avalanche

The following illustration visualizes the changes in bits when given inputs b'bob' and b'cob',
MD5(bob):
   9   f   9   d   5   1   b   c   7   0   e   f   2   1   c   a
1001111110011101010100011011110001110000111011110010000111001010
   5   c   1   4   f   3   0   7   9   8   0   a   2   9   d   8
0101110000010100111100110000011110011000000010100010100111011000
MD5(cob):
   3   8   6   6   8   5   f   0   6   b   e   e   c   b   9   f
0011100001100110100001011111000001101011111011101100101110011111
   3   5   d   b   2   e   2   2   d   a   4   2   9   e   c   9
0011010111011011001011100010001011011010010000101001111011001001
Changed Bits:
X_X__XXXXXXXX_XXXX_X_X___X__XX_____XX_XX_______XXXX_X_X__X_X_X_X
_XX_X__XXX__XXXXXX_XXX_X__X__X_X_X____X__X__X___X_XX_XXX___X___X

In this example, the difference between the hash of “bob” and “cob” impacted 64 bits out of 128. Not bad! Avalanche is an important property, and we will see it again with ciphers in Chapter 3.

Exercise 2.3. Observing Avalanche

Compare the bit changes between a wide range of input values.

Avalanche helps collision resistance , because it is hard to produce a document, then come up with predictable changes that will still cause it to produce the same digest. If a small change in a document produces an unpredictable and large change in the digest, then creating collisions on purpose is likely to be a difficult problem, pushing us toward brute force again to solve it.

Remember the birthday analogy from earlier? Finding collisions is not as hard as finding a value in the preimage. Preimage resistance for an n-bit digest means an attacker would expect to compromise your hash after trying 2n, where it would only take 2(n/2) attempts to find a collision. That’s not half as many tries, that’s a number of tries with half as many zeros in it. The difference is astounding. Concretely for MD5, finding a document for a given digest should take about 2128 attempts, where finding two documents that collide should only take 264 attempts.

As it happens, MD5’s collision resistance is actually nowhere near that good. It has been “broken,” meaning that techniques have been discovered for finding collisions in far fewer than the expected 264 attempts. In short, the problem can be solved by something other than brute force and in less than an hour [17]. Keep that in mind, and we’ll come back to it later on .

Digestible Hash

By this point, you should know enough to create a Python program that computes the MD5 digest5 of a file. This is a common use for hashes and a good exercise to work through. Remember, you must use Python bytes, not Python Unicode strings, as inputs. If you try to open a Python file with the default mode, it may open it as a text file and read the data as strings, doing implicit decoding. You should, instead, open the file in “rb” mode so that all reads produce raw bytes. For a text file, you might be tempted to read the data as a string and then use the string’s encode method to convert to bytes, but depending on configuration, this encoding may not be what you expect and lead to nasty surprises.

Exercise 2.4. MD5 of a File

Write a python program that computes the MD5 sum of the data in a file. You don’t need to worry about any of the file’s metadata, such as last modified time or even the file’s name, only its contents.

You should check your work for Exercise 2.4. If you’re using an Ubuntu Linux system , the md5sum utility is already installed. Run this utility from the command line with a file as input and see if it produces the same hex digest as your utility.

Speaking of Ubuntu, this is a perfect example of using hashes for file integrity. Visit the web site for Ubuntu releases. At the time of this writing, the web site is https://releases.ubuntu.com . If you take a look at the “Bionic Beaver” distribution, for example, you will find that there are number of files available for download. In particular, there are two ISOs, but they are available directly or through other downloading technologies such as BitTorrent.

There’s also a file called MD5SUMS. Take a look. For this distribution, the contents of this file should be as follows:
f430da8fa59d2f5f4262518e3c177246 *ubuntu-18.04.1-desktop-amd64.iso
9b15b331455c0f7cb5dac53bbe050f61 *ubuntu-18.04.1-live-server-amd64.iso

Once downloaded, you can verify that the data is uncorrupted by running an MD5 sum on the ISO.

How is the MD5 hash value helpful? It will not protect you from somebody that has compromised the Ubuntu web site. If they upload a fake Ubuntu to the web server, they can upload a fake MD5 sum as well.

The MD5 sum does, however, make it easier for you to obtain the Ubuntu ISO from other sources and know that it’s authentic. For example, suppose that you were about to download the ISO file directly from the Ubuntu web site when a coworker stops by and says that you can use the already-downloaded one they have on a USB drive. You can download the (relatively small) file of MD5 sums from Ubuntu’s official site and check them against the (much larger) files on the drive before trusting them.

Looking in the Ubuntu directory, you will also see a file called SHA1SUMS and SHA256SUMS. What are these?

So far, we’ve only talked about MD5 as a way of teaching some of the principles of hashing. MD5 was a standard approach to cryptographic hashing for a long time too, but it has been broken: people have discovered methods much faster than brute force for inducing collisions, so it is being phased out in favor of other hash functions.

Interestingly enough, being “broken” often means “someone can solve a problem in an order of magnitude less time than brute force.” For example, that might mean that preimage values can be found in 2127 tries on average instead of 2128. That’s still hard, just not as hard as it should be. When looking at articles indicating that something has been broken, it’s important to find out exactly what that means. Does it mean one of the fundamental properties no longer holds? Does it mean it holds, but isn’t as hard to get around? What if it’s more than one property? These things matter.

With MD5, researchers have found a way to “break” preimage resistance [12]. They showed that they could find a preimage for an MD5 hash faster than 2128 tries. How much faster? Well, their algorithm takes slightly longer than 2123 tries or, in decimal, 10,633,823,966,279,326,983,230,456,482,242,756,608 tries. This attack is considered theoretical because it is still not useful in practice: 2123 is still huge.

On the other hand, MD5 has been shown to be very, very broken where collision resistance is concerned. It is reasonably easy to create two inputs that produce the same MD5 output. This has been shown to enable a practical attack for getting a false certificate for use in TLS, which is used for all kinds of secure Internet communication. We won’t get into the details here, because we haven’t talked about certificates yet, but we’ll revisit this when we get to TLS at the end of the book.

On the other hand, collision resistance is not the same as second-preimage resistance. Remember that second-preimage resistance prevents you for finding a second member of the preimage for an output when you already have the first. Even though MD5’s collision resistance is broken, its preimage resistance is not. Returning to our Ubuntu distribution example, if you’re getting your distribution from an intermediary, they are not able to create an alternate distribution with the same MD5 digest.

The Ubuntu organization, however, could exploit the MD5 collision resistance weakness to create two separate distributions that have the same MD5 sum. Perhaps, in conjunction with a government, they could sell one distribution with all kinds of tracking software to another government hostile to the first. The MD5 sum could not be used to ensure that the same ISO was being distributed to all parties.

Also, once a cryptographic algorithm is broken in one way, there is increased suspicion that it is broken in other ways as well. So even though nobody has demonstrated a practical attack on MD5’s preimage or second-preimage resistance, many cryptographers worry that such vulnerabilities exist.

To reiterate the warning at the beginning of the chapter, DO NOT USE MD5. It has been deprecated for over 10 years (decimal), and some of its security flaws have been known for two decades.

The SHA-1 hash is another algorithm that was widely viewed as the replacement for MD5. SHA-1’s collision resistance has also recently been broken, however, as researchers have shown that it is relatively easy to create two inputs that hash to the same output [13]. So, as with MD5, DO NOT USE SHA-1, either.

At the time of this writing, best practice is to use SHA-256. Fortunately, this means very little to you if you are using hashlib: just change the hasher, as shown in Listing 2-6.
>>> import hashlib
>>> hashlib.md5(b'alice').hexdigest()
'6384e2b2184bcbf58eccf10ca7a6563c'
>>> hashlib.sha1(b'alice').hexdigest()
'522b276a356bdf39013dfabea2cd43e141ecc9e8'
>>> hashlib.sha256(b'alice').hexdigest()
'2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90'
Listing 2-6

Change to SHA-256

You should notice that these different hash algorithms have different lengths. MD5, of course, outputs 16 bytes (128 bits). If it isn’t obvious, SHA-1’s output is 20 bytes (160 bits). And, more simply, SHA-256’s output is 32 bytes (256 bits).

If you thought it would take a long time to invert MD5 (find a preimage for a given output), take a look at SHA-1. Because the output is 160 bits, it will take 2160 tries, or

1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976

tries to find a preimage. And SHA-256 requires 2256 tries, or

115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936.

Good luck with that!

Pass Hashwords...Er...Hash Passwords

Another common use for hash functions is password storage. When you create an account on a web site, for example, they almost never store your password. Typically, they store a hash of the password. That way, if the web site is ever compromised and the password file is stolen, the attacker cannot recover anyone’s password.

What does this mean? When you send your password (over a secure channel, via HTTPS), the server doesn’t need to store it to check it. When you registered, your password was hashed, and the hash was stored. We’ll call that H(password). When you go to log in later, you send a password that we’ll call the “proposal”: you’re proposing that this is your true password, and the server needs to verify that.

So, you attempt to log in by sending your proposed password over a secure connection, and the server now has two things for you: it can look up H(password) from your username, and it has the proposal that you just submitted. All it has to do is check that H(proposal) = H(password) and let you through if they are the same.

What if you don’t trust the service to not actually store your password? That can be a valid worry, particularly since we’ve seen so many sites with stolen passwords in recent years. Why not use the power of JavaScript to hash your password in the browser, then send that to the server? Then the server never even sees your password in memory, let alone in its database !

There are a few big problems with this:
  • The code that hashes the password in your browser came from that server in the first place, so you still have to trust the service.

  • If you don’t have a secure channel for your password, then someone can read it in transit. If you do have a secure channel, then you might as well just send the password. You have to trust the service already.

  • If you send a hash successfully, that just became your password. Yes, you can generate it from some other easy-to-remember thing, but now you have to protect that hash value as well. The server has to hash your hash anyway, so that an attacker that makes off with the database can’t just log in using what’s stored there.

In short, if you were to use a hash as your password, the right way is to generate that hash from your password and the site name of interest using a tool separate from your browser, and then use the result as your password. This is essentially the same thing as just creating a brand new password and remembering it in a secure place, like a password manager.

Just do that instead. Then the server will never see a password that you use somewhere else, because you created a brand new random one for it.

Far better than trying to solve security with a hash is to use multiple forms of authentication that are proven to make it harder to steal your identity online, typically involving a hardware token attached to your computer.

Most common forms of two-factor authentication don’t help and actually make things worse. Secret questions are one of those. It’s usually easy to get answers to them, and if it isn’t, they’re just one more hard thing to remember, unless written down. Plus, now you have several things that can be used as a password for the site, which means attackers have more opportunities to get in by guesswork. SMS has been shown to be critically weak and easy to hijack as well, so having a code sent via SMS to your phone is no good.

Properly deployed challenge-response hardware tokens don’t have any of these problems. They are something you possess instead of just another thing you know, and they can’t be guessed or spoofed by people listening in on the connection or pretending to be some other site’s login form. They can’t be given over the phone accidentally, and they can’t be forged.

Ultimately you need two or more factors for authentication anyway for true security. “Fixing passwords” is the wrong place to look for a complete solution.

If server-side hashing is correctly used and if an attacker steals the password file, they will see something like this. From looking at it, can you tell smithj’s password?
...
smithj,5f4dcc3b5aa765d61d8327deb882cf99
...

Look closely. Have you seen that hash value before?

The eagle-eyed reader will remember that hash value from the beginning of the chapter exercises. You were asked to look for that value online. What did you find?

That hash value is the MD5 hash of “password,” and yes, that password is still used far too often. But the deeper problem here is that hash values are deterministic: the same input always hashes to the same output. If an attacker has seen the MD5 sum of “password” once, he will be able to look for that same digest in every password file stolen. How can we fix this?

First, let’s not assume that we can make people stop using dumb passwords.

Let’s assume that they’re going to and we need to fix it anyway. We’ll start with the digest itself.

Recall that MD5 is not broken (in practice) with respect to preimage resistance or second-preimage resistance. So there is no practical attack, at present, for inverting this hash value to the password. Nevertheless, MD5 is broken and should not be used! So let’s take a look at a new password file.
...
smithj,5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
...
Any idea what smithj’s password is now? Yes, it is still “password” but now it’s hashed under SHA-1. That’s better, right? Oh yeah, SHA-1 is broken and should not be used! Let’s try one more time!
...
smithj,5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
...

There! Finally! We’re using a hash algorithm without known vulnerabilities. That’s better, but the problem with deterministic hashing is still a problem. If the attacker knows that this hash maps to the SHA-256 of “password,” smithj is still compromised.

This is where the idea of a “salt” comes into play. A salt is a publicly known value that is mixed in with the user’s password before hashing. By mixing in a salt value, the user’s password will not be immediately discernible as it is now.

This salt has to be chosen correctly. It needs to be unique, and it needs to be sufficiently long. One way of doing this is to use os.urandom and base64.b64encode to generate a strong, random6 salt:
>>> import hashlib
>>> hashlib.md5(b'alice').hexdigest()
'6384e2b2184bcbf58eccf10ca7a6563c'
>>> hashlib.sha1(b'alice').hexdigest()
'522b276a356bdf39013dfabea2cd43e141ecc9e8'
>>> hashlib.sha256(b'alice').hexdigest()
'2bd806c97f0e00af1a1fc3328fa763a9269723c8db8fac4f93af71db186d6e90'

Obviously, your salt output will be different from that shown in the code listing and will be different every time you call it.

Once you have the salt, you store it, then mix the password and the salt with concatenation. For example, prepend the password with the salt before hashing. Now, if the attacker gets your password file, it will be impossible to “recognize” the password from any kind of pre-computed table.

They can still try hashing the salt plus “password” to see if anything matches, though. Guesswork is always a strategy, and it’s a particularly good one for most people’s password choices.

It should be easy to see that the same salt has to be used every time for checking a user’s password. But should the same salt be used for multiple users? Could you generate the salt once for an entire web site and just reuse it?

The answer is a very strong “No!” Can you think of why? What will be the impact if two users are using the same salt? At the very least, it means that it is instantly easy to recognize if two users are sharing the same password. Thus, best practice is to store the user name and salt along with the password hash.

If our friend smithj has the terribly chosen password, “password”, at least it will be stored correctly on our system:
...
smithj,cei6LtJVQYSM+n6Cty0O2w==,
    bd51dac1e2fca8456069f38fcce933f1ff30a656320877b596a14a0e05db9567
...

We have now walked through the basics of password storage, but there are better algorithms. They are built on the same principle but do additional steps to make it even harder for an attacker to invert the password. One highly recommended algorithms for password storage is called scrypt by Colin Percival and described in RFC 7914 [16]. Other popular ones are the newer bcrypt7 ( https://pypi.org/project/bcrypt/ ) and the algorithm considered by some to be its successor: Argon2 ( https://pypi.org/project/argon2/ ).

Fortunately, using scrypt is easy using the cryptography module you set up in Chapter 1. Listing 2-7 is an example derived from the cryptography module’s online documentation. The listing derives the key (hash) to be stored on the file system.
 1   import os
 2   from cryptography.hazmat.primitives.kdf.scrypt import Scrypt
 3   from cryptography.hazmat.backends import default_backend
 4
 5   salt = os.urandom(16)
 6
 7   kdf = Scrypt(salt=salt, length=32,
 8                   n=2**14, r=8, p=1,
 9                   backend=default_backend())
10
11   key = kdf.derive (b"my great password")
Listing 2-7

Scrypt Generate

Both the key and the salt must be stored to disk. The scrypt parameters must be fixed or must also be stored. We will walk through these parameters in a moment, but first, verification is depicted in Listing 2-8 (it is presumed that salt and key are restored from disk).
1   kdf = Scrypt(salt =salt, length =32,
2                n=2**14, r=8, p=1,
3                backend=default_backend())
4   kdf.verify(b"my great password", key)
5   print("Success! (Exception if mismatch)")
Listing 2-8

Scrypt Verify

Pick Perfect Parameters

With regard to the scrypt parameters , let’s talk about backend first. The cryptography module is primarily a wrapper around a lower-level engine. For example, the module can make use of OpenSSL as such an engine. This makes the system faster (because computations aren’t being done in Python) and more secure (because it relies on a robust, well-tested library). Throughout this book, we will always rely on default_backend().

The other parameters are specific to scrypt. The length parameter is how long the key will be once the process is finished. In these examples, the password is processed into an output of 32 bytes. The parameters r, n, and p are tuning parameters that impact how long it will take to compute and how much memory is required. To better protect your password, you want the process to take longer and require more memory, preventing attackers from compromising large chunks of a database at once (every compromise should take a long time).

Fortunately for you, recommended parameters are available. The r parameter should be 8, and the p parameter should be 1. The n parameter can vary based on whether you are doing something like a web site that needs to give a relatively prompt response or something more securely stored that does not need quick responsiveness. Either way, it must be a power of 2. For the interactive logins, 214 is recommended. For the more sensitive files, a number as high as 220 is better.

This is actually an excellent segue into a more general discussion about parameters. A lot of the security in cryptography depends on how parameters are set. Unless you are a cryptography expert, know the exact details of the algorithm, and understand why they are what they are, it may be difficult to choose them properly. It is important that you familiarize yourself with what the parameters mean, at least at a high level, and how they should be used in different contexts. Refer to trusted sources, such as https://cryptodoneright.org , for advice and recommendations. Keep an eye on these sources too. What is presumed to be secure can change as new attacks and computational resources are unveiled.

Cracking Weak Passwords

Let’s take a look at how attackers try to crack passwords. Unfortunately for smithj, choosing such a bad password means that he will most likely be compromised if the password file gets stolen, since attackers will try common words (including words in other stolen databases) against all the hashes anyway. But even less sophisticated methods would probably figure out the password as well.

In this section, we are going to practice cracking weak passwords using the least sophisticated method of all: brute force. This exercise is meant to reinforce why good passwords are so important.

The scenario is this: an attacker has a password file with usernames, salts, and password hashes. What can they do? Well, they could just try all lowercase letter combinations up to a certain length, starting, for example, with “a,” “b,” “c,” and so on.

To make these exercises a little bit easier to start, Listing 2-9 shows some simple code for generating all possible combinations of an alphabet set up to a maximum length.
1   def generate(alphabet, max_len):
2       if max_len <= 0: return
3       for c in alphabet:
4           yield c
5       for c in alphabet:
6           for next in generate(alphabet, max_len-1):
7               yield c + next
Listing 2-9

Alphabet Permutations

Calling generate('ab', 2) will generate 'a', 'b', 'aa', 'ab', 'ba', 'bb'. Using helpful sets in the built-in string module, such as
  • string.ascii_lowercase

  • string.ascii_uppercase

  • string.ascii_letters

makes the following exercises fairly easy. Recall that hashing algorithms require bytes as inputs, so don’t forget to do an encode operation before passing these generated strings to the hashing function, like this:
string.ascii_letters.encode('utf-8').

ASCII letters encode correctly to bytes, so this will not lead to incorrect hashing or unexpected behaviors.

Exercise 2.5. The Power Of One

Write a program that does the following ten times (so, ten full loops with the time computed):
  • Randomly select a single, lowercase letter. This is the “preimage seed.”

  • Use MD5 to compute the hash of this initial letter. This is the “test hash.”

  • In a loop, iterate through all possible lowercase one-letter inputs.
    • Hash each letter in the same way as before, and compare against the test hash.

    • When you find a match, stop.

  • Compute the amount of time it took to find a match.

How long, on average, did it take to find a match for a random preimage seed?

Exercise 2.6. The Power of One, But Bigger!

Repeat the previous exercise, but use an increasingly bigger input alphabet set. Try the test with both lowercase and uppercase letters. Then try it with lowercase letters, uppercase letters, and numbers. Finally, try all printable characters (string.printable).
  • How many total symbols are in each input set?

  • How much longer does each run take?

Exercise 2.7. Password Length’s Effects on Attack Time

Repeat the previous exercise, but this time for two-symbol inputs. Then try it with three and four symbols at a time. How much longer does it take to invert the randomly chosen input?

You will notice that increasing the length of the password and increasing the size of the alphabet both increase the time it takes to invert the hash. Let’s look at the math.

When using just lowercase letters, how many possible one-symbol inputs are there? Rather trivially, there are 26 lowercase letters in ASCII, so 26 one-symbol inputs. At worst, it will take 26 hash computations to invert a one-letter password. But, if we have both lowercase and uppercase letters, this increases the number of hashes needed to 52. Adding digits increases this to 62. There are 100 characters in string.printable. That’s nearly a fourfold increase of the worst-case number of hashes required to do brute-force inversion.

What about when we increase the size to two input symbols? How many two-symbol passwords are there using just lowercase letters? If you can have 26 characters for the first symbol and 26 characters for the second symbol, then there are 26 ∗ 26 = 676 total combinations. That’s quite a jump!

Now look what happens if you use two symbols drawn from the 52 uppercase and lowercase letters. The math works out to be 52 ∗ 52 = 2704! Doubling the size of the input set quadrupled the complexity for two-symbol inputs! If we throw in digits, the worst-case computation is 3844 hashes, and for all printable ASCII characters, it is around 10,000 hashes.

Do the math for three, four, and five symbols, and you can easily see why longer passwords matter. Hackers with GPU-enabled rigs are able to invert anything smaller than six characters, and most passwords under eight, so at a minimum, passwords should be that long. And for the reasons demonstrated here, choosing from all printable letters greatly increases the complexity.

Exercise 2.8. More Hash, More Time

Choosing a complex-to-invert password is the responsibility of the user, but the systems storing the passwords can also slow down attackers by using a more complicated hashing function. Repeat any of the preceding exercises that use MD5, but now use SHA-1 and SHA-256 instead. Record how much longer it takes to get through the brute-force operations. Finally, try out brute force using scrypt. You might not get very far!

One final note. Just because a password is big doesn’t mean it is secure. Attackers will also use large dictionaries to look for known words and phrases, even with various common number or symbol substitutions. A password such as “chocolatecake” is pretty long, but still easily broken. Randomly chosen letters or words are still the best bet. The key is that they are “random,” meaning you would never find them in any real writing or common transformations on real writing. Typically, choosing passphrases that are composed of common utterances reduces a successful attack to seconds instead of years.

Proof of Work

Another area where hashing is used extensively is so-called “proof-of-work” schemes in blockchain technologies. To introduce this, we need to do a very quick overview of how blockchains work.

The basic idea of a blockchain is a “distributed ledger.” The system is a ledger because it records information related to transactions between participants. It can also store additional information, but the primary operations are transactions. It is a distributed ledger because its contents are stored across the set of participants and not in any central location.

The problem is that there is no central location to enforce the correctness of the system. How does the ledger not get corrupted (intentionally or otherwise) by the users? Note that we won’t go into the ledger in detail here, but we do want to talk about the blocks that a ledger is composed of.

Every transaction must be stored in a block. There’s nothing special about a “block”; it’s just a collection of data. Each transaction within the block must be digitally signed by the transactor (we will discuss signatures more in Chapter 5, but for now, simply accept that it means nobody can create a transaction for somebody else without their private key). The overall block structure is protected by a hash. Blocks are copied to the entire set of participants; should anyone try to “lie” about the contents of a block, the data wouldn’t verify correctly and their information would be rejected.

How does a new block get created, and how does it get the protective hash? For this part of the discussion, we will use the Bitcoin network blockchain to walk through these concepts. The designer (or designers, the source is actually unknown) of Bitcoin, who goes by “Satoshi Nakamoto,” wanted to control how quickly new blocks could be created and also wanted the system to incentivize participation. The solution was to award bitcoins to the “miner” that produced a new block while making the production of the new block very difficult.

Basically, at any given time, various parties known as miners are searching for the next block in the blockchain. Any user of blockchain can request a transaction. They broadcast their desired transaction throughout the blockchain network and miners will pick them up. The miners take some set of requested transactions (there is a limited number per block) and create a candidate block. This candidate block has all the right pieces of information. It has the transactions, the metadata, and so forth. But it isn’t the next block in the blockchain until the miner can solve a cryptographic puzzle.

That puzzle is to find a special kind of SHA-256 hash value, specifically a value smaller than a certain threshold. As we discussed earlier, finding an input that produces one particular output would take a really, really long time, but finding any output less than a certain value takes a lot less time. Making that threshold smaller reduces the number of valid hashes, requiring more work to find a suitable value, and that’s how Bitcoin adjusts the difficulty to account for faster hardware or larger computational pools as time goes on. Ultimately, it takes about 10 minutes for the entire Bitcoin network to find a suitable hash. If it takes less than that on average over a period of weeks, the maximum allowed hash value is decreased. Figure 2-1 shows two different example blocks, one with a suitable nonce (a random value that miners are trying to find to produce an acceptable hash) and one without, where the maximum allowed hash value is 2236–1 (20 leading zeros required). For Bitcoin, the very easiest that a problem is allowed to be is determined by a maximum value of 2224–1, which would take our little program an average of 212 times longer than before. That translates to 11.3 hours, and the difficulty is much harder than that today.
../images/472260_1_En_2_Chapter/472260_1_En_2_Fig1_HTML.png
Figure 2-1

Two block hashes with the same content but different nonce values. A nonce that produces a hash with 20 leading binary zeros (5 leading zeros in hexadecimal) is valid. Requiring 20 leading zeros is the same as requiring that the hash number be less than 2 ∗  ∗ 236.

Our program definitely won’t be beating the network’s 10-minute average expectation anytime soon.

Saying that the first few bits must be zero is the same, by the way, as saying the hash value number (the hash is just a number, just like any other string of bits) should be less than some threshold that happens to be a power of 2. Since good hash functions (like SHA-256) produce essentially random hash values, the more structure you impose on the hash, the longer it takes to find one that fits. You can get some intuition for this by thinking about the number of zeros as defining the size of the search space: if you must have a single leading zero, then it’s basically a coin flip; it should only take two tries on average to find a suitable hash that starts with a zero bit. If, on the other hand, you need to find a hash with 8 leading zeros, that’s a harder problem: 256 different numbers can be represented in 8 bits, so on average it will take 256 attempts to find a suitable value.

That’s why this strategy is called “Proof of Work”: if you found a suitable hash under the threshold, you had to have done some work (or you broke the hash function, which is deemed to be extremely unlikely, but potentially awesome for you).

This raises an interesting question: how do each of the network participants decide how hard the problem should be? It isn’t like there is a central authority telling everyone that the difficulty just went from 11 to 12, for example. That would defeat the whole purpose of the network. The “authority” in the network is tacit agreement between participants to use the same algorithms for determining these things. When there is someone on the network doing things differently, their blocks are simply rejected by everyone else and they thus have no incentive to do it wrong. Majority rules.

In the specific case of hashing difficulty, each participant knows the standard algorithm for computing what the number of leading zeros should be and uses that to do mining (or to reject bad proposals from wayward participants looking to compute an easy hash).

You might be asking, however, how a different value of hash is computed when the input data really doesn’t change. That’s a great question, since hashes are deterministic: they always produce the same output given the same input (they wouldn’t be very useful, otherwise!). The answer is that they change one little piece of the input, called the “nonce.” It’s just a number, and it isn’t part of the actual block data: its sole purpose is to enable the proof-of-work concept. When searching for a suitable hash, the participant tries hashing the block with different values for the nonce, typically searching randomly or merely adding 1 to the last value at each attempt. Eventually a suitable hash value is found and the block is sent to all other participants for validation.

Every participant then verifies the block by performing the hash for themselves, checking the leading zeros against their algorithm, and making sure that their answer matches the submitted hash value. If it’s good, they accept it and the chain lengthens.

Exercise 2.9. Proof of Work

Write a program that feeds a counter into SHA-256, taking the output hash and converting it to an integer (we did this earlier before converting to binary). Have the program repeat until it finds a hash that is less than a target number. The target number should start out pretty big, like 2255. To make this more like blockchain, include some arbitrary bytes to be combined with the counter.

Time to Rehash

We have covered a lot of information about what hashes are and how to use them, including why you should never use MD5 unless you are teaching people that it is broken, and how to use them for more secure password storage and even crypto currency. Hashing is a powerful and important part of cryptography, and we will be seeing it again and again as we move forward.

Now that we have learned a bit about how to digest a document into a safely representative value, it’s time to back up a bit and revisit encryption.