Chapter 3. Better Living Through Better Hashing

In this chapter, you will learn:

How to store (key, value) pairs in a symbol table and retrieve values associated with a key.¹

How to use an array to store (key, value) pairs for efficient search, if the size of the array is sufficiently large, compared to the number of pairs stored.

How to use an array of linked lists to store (key, value) pairs to support the extra ability to remove a key.

How to resize a symbol table to remain efficient.

Amortized analysis to determine the average runtime performance when the behavior of an operation can change with successive invocations.

How geometric resizing reduces the frequency of the costly resize operation, meaning that `put()` has amortized O(`1`) performance on average.

How a computational hash function can uniformly distribute key values, which ensures the efficiency of the symbol table implementation.

Associating Values with Keys

Instead of just storing values, you might need to store a collection of (key, value) pairs to associate values with specific keys. This is known as a symbol table data type, which lets you find the associated value given just its key. Hashing offers an efficient alternative to manually searching through a collection from start to finish just to find a (key, value) pair. It outperforms the search algorithms I covered earlier. A symbol table can be efficient even while allowing for keys (and their values) to be removed. You give up the ability to retrieve all keys in a specific order, for example, in ascending order, but the resulting symbol table provides optimal performance for retrieving or storing the value associated with any individual key.

Let’s say you want to write a function print_month(month, year) to print a calendar for any month and year. For example, print_month('February', 2024) would output the following:

   February 2024
Su Mo Tu We Th Fr Sa
             1  2  3
 4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29

What information would you need? You need the weekday for the first day of that month in that year (above it is Thursday), and you need to know that February has 28 days (or 29 in a leap year, such as 2024). You might use a fixed array, month_length, with values that record the length of each month in days for the year:

month_length = [ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

January, the first month, has 31 days, so month_length[0] = 31. February is the next month, with 28 days, so the next value in the list is 28. The final value in month_length is 31, since December, the last month of the year, has 31 days.

Given what I have presented in this book so far, you might choose to store a key_array of the same length and search it for the month to determine the corresponding value in month_length; the following code prints February has 28 days:

key_array   = [ 'January', 'February', 'March', 'April', 'May', 'June', 'July',
                'August', 'September', 'October', 'November', 'December' ]

idx = key_array.index('February')
print('February has', month_length[idx], 'days')

While the preceding code snippet will work, in the worst case, the key you are looking for is either the last one in the list (December) or an invalid month name, which means you have to inspect all values in the array. This means the time to locate the associated value for a key is directly proportional to the number of keys stored. If there were hundreds of thousands of (key, value) pairs, this approach would rapidly become so inefficient as to be unusable. Given this starting point, you should try to implement print_month(); for the record, Listing 3-1 contains my implementation, which uses the datetime and calendar Python modules.

Listing 3-1. Code to print readable calendar for any month and year

from datetime import date
import calendar

def print_month(month, year):
  idx = key_array.index(month)                
  day = 1

  wd = date(year,idx + 1,day).weekday()       
  wd = (wd + 1) % 7                           
  end = month_length[idx]                     
  if calendar.isleap(year) and idx == 1:      
    end += 1

  print('{} {}'.format(month,year).center(20))
  print('Su Mo Tu We Th Fr Sa')
  print('   ' * wd, end='')                   
  while day <= end:
    print('{:2d} '.format(day), end='')
    wd = (wd + 1) % 7                         
    day += 1
    if wd == 0: print()                       
  print()

: Find index to use for month_length, an integer from 0 to 11.
: Returns weekday for first day of given month, using 0 for Monday. Note date() uses idx + 1 since its month argument must be an integer from 1 to 12.
: Adjust to have Sunday be the weekday with a value of 0, instead of Monday.
: Determine length of the month corresponding to input parameter.
: In a leap year, February (index 1 when counting from 0) has 29 days.
: Add spaces in first week to start day 1 at right indentation.
: Advance day and weekday for next day.
: Insert line break before Sunday to start a new line.

For the task of finding the associated value for a given key in a collection of N (key, value) pairs, I judge efficiency by counting the number of times any array index is accessed. A function searching for a string in key_array could inspect up to N array index positions, so that function’s performance is O(N).

Note

Python provides a list data structure instead of an array. While lists can dynamically grow in size, in this chapter, I continue to use the term array since I do not take advantage of this capability.

Python has a built-in dict type (an abbreviation of dictionary) that associates values with keys. In the following, days_in_month is a dict that associates an integer value (representing the length of that month) with a string key containing the capitalized English name:

days_in_month = { 'January'  : 31,  'February'  : 28,  'March'     : 31,
                  'April'    : 30,  'May'       : 31,  'June'      : 30,
                  'July'     : 31,  'August'    : 31,  'September' : 30,
                  'October'  : 31,  'November'  : 30,  'December'  : 31 }

The following code prints April has 30 days:

print('April has', days_in_month['April'], 'days')

The dict type can locate a key with an average performance of O(1), independent of the number of (key, value) pairs it stores. This is an amazing accomplishment, like a magician pulling a rabbit from a hat! In Chapter 8, I provide more details about dict. For now, let’s see how it works. The following code provides some mathematical intuition as to how this happens. The key idea is to turn a string into a number.

Try this: consider the letter 'a' to be the value 0, 'b' to be 1, and so on through 'z' = 25. You can then consider 'June' to be a number in base 26, which represents the base 10 value j × 17,576 + u × 676 + n × 26 + e = 172,046 in base 10.² This computation can also be written as 26 × (26 × (26 × j + u) + n) + e, which reflects the structure of the base26() method shown in Listing 3-2.

Listing 3-2. Convert word to an integer assuming base 26

def base26(w):
  val = 0
  for ch in w.lower():                   
    next_digit = ord(ch) - ord('a')      
    val = 26*val + next_digit            
  return val

: Convert all characters to lowercase.
: Compute digit in next position.
: Accumulate total and return.

base26() uses the ord() function to convert a single character (such as 'a') into its integer ASCII representation.³ ASCII codes are ordered alphabetically, so ord('a') = 97, ord('e') = 101 and ord('z') = 122. To find the value associated with 'e', simply compute ord('e') – ord('a') to determine that 'e' represents the value 4.

When computing base26() values for a string, the resulting numbers quickly grow large: 'June' computes to the value 172,046, while 'January' computes to 2,786,534,658.

How can these numbers be cut down to a more manageable size? You might know about the modulo operator % provided by most programming languages. This operator returns the integer remainder when dividing a large integer (i.e., the base26(month) computation) by a much smaller integer.

Tip

You have likely used modulo in real life without knowing it. For example, if it is currently 11:00 a.m., what time of day will it be in 50 hours? You know in 24 hours it will once again be 11:00 a.m. (on the next day), and after 48 hours it will again be 11:00 a.m. (on the second day). This leaves only two hours to account for, which tells you that in 50 hours it will be 1:00 p.m. Mathematically speaking, 50 % 24 = 2. In other words, you cannot evenly divide 50 by 24 since you will have 2 left over. When N and M are positive integers, N % M is guaranteed to be an integer from 0 to M – 1.

With some experimentation, I discovered that base26(m) % 34 computes a different integer for each of the twelve English month names; for example, base26('August') computes to 9,258,983, and you can confirm that 9,258,983 % 34 = 1. If you create a single array containing 34 values, as shown in Figure 3-1, then independently of the number of (key, value) pairs, you can determine the number of days in a month by computing its associated index into day_array. This means that August has 31 days; a similar computation for February shows it has 28 days.

Take a moment to reflect on what I have accomplished. Instead of iteratively searching for a key in an array, I can now perform a simple computation on the key itself to compute an index position that contains its associated value. The time it takes to perform this computation is independent of the number of keys stored. This is a major step forward!

Array containing month lengths with other data.

But this was a lot of work. I had to (a) craft a special formula to compute unique index positions, and (b) create an array containing 34 integers, only 12 of which are relevant (meaning that more than half of the values are wasted). In this example, N equals 12, and the amount of dedicated storage, M, equals 34.

Given any string s, if the value in day_array at index position base26(s) % 34 is –1, you know s is an invalid month. This is a good feature. You might be tempted to confirm that a given string, s, is a valid month whenever day_array[base26(s)%34] > 0, but this would be a mistake. For example, the string 'abbreviated' computes to the same index position as 'March', which might mistakenly tell you that 'abbreviated' is a valid month! This is a bad feature, but I now show how to overcome this issue.

Hash Functions and Hash Codes

base26() is an example of a hash function that maps keys of arbitrary size to a fixed-size hash code value, such as a 32-bit or 64-bit integer. A 32-bit integer is a value from –2,147,483,648 to 2,147,483,647, while a 64-bit integer is a value from –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. As you can see, a hash code can be negative.

Mathematicians have been studying hashing for decades, developing computations to convert structured data into fixed-size integers. Programmers can take advantage of their contributions since most programming languages have built-in support for computing a hash code for arbitrary data. Python provides such a hash() method for immutable objects.

Note

The only necessary property of a hash function is that two objects that are equal to each other must compute to the same hash code; it cannot simply be a random number. When hashing an immutable object (such as a string), the computed hash code is usually stored to reduce overall computation.

Hash functions do not have to compute a unique hash for each key; this is too great a computational challenge (but see the section on perfect hashing at the end of this chapter). Instead, the expression hash(key) % M uses the modulo operation to compute a value guaranteed to be an integer from 0 to M – 1.

Table 3-1 lists the 64-bit hash() value for a few string keys and the corresponding hash code modulo expressions. Mathematically, the probability that two keys have the exact same hash() value is vanishingly small.⁴ There are two potential hash code collisions; both smell and rose have a hash code of 6, while both name and would have a hash code of 10. You should expect that two different strings could have the exact same hash code value.

Table 3-1. Example `hash()` and hash code expressions for a table of size 15
`key`	`hash(key)`	`hash(key) % 15`
a	`–7,995,026,502,214,370,901`	`9`
rose	`–3,472,549,068,324,789,234`	`6`
by	`–6,858,448,964,350,309,867`	`8`
any	`2,052,802,038,296,058,508`	`13`
other	`4,741,009,700,354,549,189`	`14`
name	`–7,640,325,309,337,162,460`	`10`
would	`274,614,957,872,931,580`	`10`
smell	`7,616,223,937,239,278,946`	`6`
as	`–7,478,160,235,253,182,488`	`12`
sweet	`8,704,203,633,020,415,510`	`0`

If I use hash(key) % M to compute the hash code for key, then M must be at least as large as the number of expected keys to make room to store all associated values.5

Note

Currently, Java and Python 2 compute a predictable hash code for strings. In Python 3, the default hash() code values for strings are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python as a cybersecurity measure. Specifically, if a hacker can generate keys that produce specific hash code values (thus violating the uniform distribution of keys), the performance of the hashtable defined in this chapter degrades to O(N), leading to denial-of-service attacks.6

A Hashtable Structure for (Key, Value) Pairs

The following Entry structure stores a (key, value) pair:

class Entry:
  def __init__(self, k, v):
    self.key = k
    self.value = v

Listing 3-3 defines a Hashtable class that constructs a table array capable of storing up to M Entry objects. Each of these M index positions in the array is called a bucket. For our first attempt, either a bucket is empty or it contains a single Entry object.

Listing 3-3. Ineffective hashtable implementation

class Hashtable:
  def __init__(self, M=10):
    self.table = [None] * M    
    self.M = M

  def get(self, k):            
    hc = hash(k) % self.M
    return self.table[hc].value if self.table[hc] else None

  def put(self, k, v):	       
    hc = hash(k) % self.M
    entry = self.table[hc]
    if entry:
      if entry.key == k:
        entry.value = v
      else:                    
        raise RuntimeError('Key Collision: {} and {}'.format(k, entry.key))
    else:
      self.table[hc] = Entry(k, v)

: Allocate a table to hold M Entry objects.
: The get() function locates the entry associated with the hash code for k and returns its value, if present.
: The put() function locates the entry associated with the hash code for k, if present, and overwrites its value; otherwise it stores a new entry.
: A collision occurs when two different keys map to the same bucket identified by its hash code value.

You can use this hashtable as follows:

table = Hashtable(1000)
table.put('April', 30)
table.put('May', 31)
table.put('September', 30)

print(table.get('August'))     # Miss: should print None since not present
print(table.get('September'))  # Hit: should print 30

If everything works properly, three Entry objects will be created for these three (key, value) pairs in an array that has room for 1,000 objects. The performance of put() and get() will be independent of the number of Entry objects in the array, so each action can be considered to have constant time performance, or O(1).

A miss occurs when get(key) fails to find an Entry in the bucket identified by the hash code for key. A hit occurs when get(key) finds an Entry whose key matches key. These are both normal behaviors. However, there is still no strategy to resolve collisions when the hash codes for two (or more) keys compute to the same bucket. If you don’t resolve collisions, then a put() could overwrite an existing Entry for a different key, causing the Hashtable to lose keys—this must be prevented.

Detecting and Resolving Collisions with Linear Probing

It may happen that two entries, e1 = (key1, val1) and e2 = (key2, val2), share the same hash code for their respective keys even though key1 is different from key2. In Table 3-1, both 'name' and 'would' have the same hash code of 10. Assume that e1 is put in the Hashtable first. You will encounter a hash collision later when you try to put e2 in the same Hashtable: this is because the identified bucket is nonempty with an Entry whose key, key1, is different from key2, the key for e2. If you can’t resolve these collisions, you won’t be able to store both e1 and e2 in the same Hashtable.

Open addressing resolves hash collisions by probing (or searching) through alternative locations in table when put() encounters a collision. One common approach, called linear probing, has put() incrementally check higher index positions in table for the next available empty bucket if the one designated by the hash code contains a different entry; if the end of the array is reached without finding an empty bucket, put() continues to search through table starting from index position 0. This search is guaranteed to succeed because I make sure to always keep one bucket empty; an attempt to put an entry into the last remaining empty bucket will fail with a runtime error signifying the hashtable is full.

You might wonder why this approach will even work. Since entries can be inserted into an index position different from their hash(key), how will you find them later? First, you can see that entries are never removed from Hashtable, only inserted. Second, as more Entry objects are inserted into buckets in the table structure, long runs of nonempty buckets appear in table. An Entry, e1, could now be in a different location, anywhere from hash(e1.key) or, searching to the right, until the next empty bucket is found (wrapping around as necessary).

To demonstrate collision resolution, let’s now add five entries into a Hashtable, with M = 7 buckets in Figure 3-2 (which only shows their keys). Buckets shaded in gray are empty. Key 20 is inserted into table[6], since 20 % 7 = 6; similarly, 15 is inserted into table[1], and 5 is inserted into table[5].

Adding an entry with key 26 causes a collision since table[5] is already occupied (by an entry whose key is 5, highlighted in Figure 3-2), so linear probing next checks table[6], which is also unavailable, and so the entry is placed in the next available bucket, table[0] (because open addressing wraps around when searching for the next available empty index position). Adding an entry with key 19 similarly causes a collision with every nonempty bucket, and the entry is finally placed in table[2].

Using the existing Hashtable in Figure 3-2, I can put one additional entry into the table (since I need to leave one bucket empty). Where would I place an entry with a key of 44? The hash code for its key is 44 % 7 = 2, which is occupied; searching for the next available bucket places this entry in table[3]. Since get() and put() use the same search strategy, this entry will eventually be found later when invoking get(44).

Note

A chain of entries for hash code hc in an open addressing Hashtable is a sequence of consecutive Entry objects in table. This sequence starts at a given table[hc] and extends to the right (wrapping around to the first index in table as needed), up to but not including the next available unused table index position. In Figure 3-2, the chain for hash code 5 has a length of 5 even though there are only three entries whose key has that hash code. Also, there are no keys with a hash code of 2, but the chain for hash code 2 has a length of 1 because of an earlier collision. The maximum length of any chain is M – 1 because one bucket is left empty.

Listing 3-4 shows the modifications to Hashtable to support open addressing; the Entry class is unchanged. Hashtable keeps a running count, N, of the number of entries it stores so it can ensure there is at least one unused bucket (though it does not need to keep track of where that empty bucket is!). This is critically important to ensure the while loop in the get() and put() functions will eventually terminate.

Listing 3-4. Open addressing implementation of `Hashtable`

class Hashtable:
  def __init__(self, M=10):
    self.table = [None] * M
    self.M = M
    self.N = 0

  def get(self, k):
    hc = hash(k) % self.M           
    while self.table[hc]:
      if self.table[hc].key == k:   
        return self.table[hc].value
      hc = (hc + 1) % self.M        
    return None                     

  def put(self, k, v):
    hc = hash(k) % self.M           
    while self.table[hc]:
      if self.table[hc].key == k:   
        self.table[hc].value = v
        return
      hc = (hc + 1) % self.M        

    if self.N >= self.M - 1:        
      raise RuntimeError ('Table is Full.')

    self.table[hc] = Entry(k, v)    
    self.N += 1

: Start at the first bucket where an entry whose key is k could be.
: If found, return the value associated with k.
: Otherwise advance to the next bucket, wrapping around to 0 as needed.
: If table[hc] is empty, you know k is not in table.
: If found, update the value associated with the Entry for k.
: Once you’ve established that k is not in table, throw RuntimeError if hc is the last empty bucket.
: Create new Entry in table[hc] and update N, the count of keys.

Given this new mechanism, what is the performance of get() and put()? Is the time to perform these operations independent of N, the number of keys in the Hashtable? No!

The Liberating Linked List

The most commonly used dynamic data structure is a linked list. Instead of allocating a contiguous block of memory, as needed for an array, a linked list stores data in memory fragments called nodes that are linked together so the programmer can start at a designated first node, and follow links (also called references) to the next node in the list when searching for a desired target.

The following linked list consists of three nodes, each of which stores a different value. Each node has a next reference (i.e., an arrow) to the next node in the list. The next reference in the final node in a list is None.

The size of a linked list is the number of nodes you encounter starting at first and traversing to each subsequent node using next until None is reached. The nodes are dynamically allocated memory. For object-oriented languages, nodes are objects. Both Java and Python rely on automatic memory to reclaim this memory when nodes are removed from a linked list, though other programming languages may require the programmer to manually release the memory when nodes are removed from a linked list.

Prepending a value

Consider what happens in a worst case scenario. To an empty Hashtable, add an entry, and assume it is placed in table[0]. What if each of the next N – 1 requests to add an entry collides with this existing key in the table storage? What is the total number of buckets searched? For the first request, it is 1; for the second request, 2, and for the third request, 3. As you can see, there will be k buckets inspected for request number k. The total number, then, is the sum of 1 + 2 + … + (N – 1), which conveniently equals N × (N – 1)/2. The average computation is this quantity divided by N, which leaves (N – 1)/2.

Note

Based on what I presented in Chapter 2, I can classify (N – 1)/2 as O(N) as follows: First, this formula can be written as N/2 – ½. As the problem instance size increases, the dominant term is N/2. In the worst case, the average number of buckets searched is directly proportional to N (in this case, half of N).

I have just demonstrated that in the worst case, the average number of buckets inspected is O(N). I use algorithmic analysis here to estimate the count of buckets (not performance time or number of operations) since the runtime performance for get() and put() is directly related to the number of buckets inspected.

This is a stalemate. You can increase the allocated table size, M, to be much greater than the number of keys to be inserted, N, and reduce the number of collisions and the overall time to execute. But if you don’t plan correctly—that is, if N becomes closer and closer to M—you could quickly have horrible performance: even worse, the table can still fill up, preventing more than M – 1 entries from being stored. Table 3-2 compares the performance of inserting N entries into a Hashtable of size M that uses open addressing to resolve conflicts. Observe the following:

For a small value of N, say 32, the average cost is nearly the same (read the values from left to right in that row) regardless of the size, M, of the Hashtable, because M is so much larger than N.
For any Hashtable of size M, the average time to insert N keys consistently increases as N increases (read the numbers from top to bottom in any column).
If you look “diagonally southeast” in the table, you will see that the timing results are more or less the same. In other words, if you want the average performance of inserting 2 × N keys to be the same as inserting N keys, then you need to double the initial size of the Hashtable.

Table 3-2. Average performance to insert N keys into a `Hashtable` of size M (in milliseconds)
	8,192	16,384	32,768	65,536	131,072	262,144	524,288	1,048,576
32	`0.048`	`0.036`	`0.051`	`0.027`	`0.033`	`0.034`	`0.032`	`0.032`
64	`0.070`	`0.066`	`0.054`	`0.047`	`0.036`	`0.035`	`0.033`	`0.032`
128	`0.120`	`0.092`	`0.065`	`0.055`	`0.040`	`0.036`	`0.034`	`0.032`
256	`0.221`	`0.119`	`0.086`	`0.053`	`0.043`	`0.038`	`0.035`	`0.033`
512	`0.414`	`0.230`	`0.130`	`0.079`	`0.059`	`0.044`	`0.039`	`0.035`
1,024	`0.841`	`0.432`	`0.233`	`0.132`	`0.083`	`0.058`	`0.045`	`0.039`
2,048	`1.775`	`0.871`	`0.444`	`0.236`	`0.155`	`0.089`	`0.060`	`0.047`
4,096	`3.966`	`1.824`	`0.887`	`0.457`	`0.255`	`0.144`	`0.090`	`0.060`
8,192	`—`	`4.266`	`2.182`	`0.944`	`0.517`	`0.276`	`0.152`	`0.095`
16,384	`—`	`—`	`3.864`	`1.812`	`0.908`	`0.484`	`0.270`	`0.148`

This working implementation of the symbol table data type only becomes efficient when the available storage is noticeably larger than the number of keys to be inserted. If you misjudge the total number of keys in the symbol table, then performance will be surprisingly inefficient, sometimes 100 times slower. Worse, it is not really useful since I have not yet added the capability to remove a key from the symbol table. To overcome these limitations, I need to introduce the linked list data structure.

Separate Chaining with Linked Lists

I now modify Hashtable to store an array of linked lists in a technique known as separate chaining. Where linear probing looked for an empty bucket in which to place an entry, separate chaining stores an array of linked lists where each linked list contains entries whose keys compute to the same hash code. These linked lists are formed from LinkedEntry nodes, shown in Listing 3-5.

Listing 3-5. `LinkedEntry` node structure to support linked list of (key, value) pairs

class LinkedEntry:
  def __init__(self, k, v, rest=None):
    self.key = k
    self.value = v
    self.next = rest

: rest is an optional argument, which allows the newly created node to link directly to an existing list pointed to by rest.

More accurately, table[idx] refers to the first LinkedEntry node in a linked list of nodes whose keys share the same hash code value of idx. For convenience, I still refer to the M buckets, which can be empty (in which case table[idx] = None) or contain the first LinkedEntry node in a linked list.

Note

The chain concept, introduced earlier for open addressing, is more clearly visible with linked lists: the length of the linked list is the length of the chain.

I continue to use hash(key) % M to compute the hash code for an entry to be inserted. All entries with the same hash code exist within the same linked list. In Figure 3-3, the table array has seven buckets, and therefore seven potential linked lists; once again, only the keys are shown.

Figure 3-3 adds the (key, value) pairs in the same order as Figure 3-2: the first three entries with keys 20, 15, and 5 create three linked lists, one for each of the buckets associated with the hash code for these keys. Buckets shaded in gray are empty. Adding an entry whose key is 26 causes a collision in table[5], so a new entry is prepended to the beginning of this linked list, resulting in a linked list with two entries. When adding the final entry whose key is 19, it is also prepended to the linked list in table[5], resulting in a linked list with three entries.

Pay attention to how the newest entry added to a bucket becomes the first entry in that bucket’s linked list. Once put() determines that the entry’s key does not appear in the linked list, it is efficient to just prepend a new LinkedEntry node at the beginning of the linked list. When the next reference is None, that entry is the last one in the linked list. Listing 3-6 shows the modifications to Hashtable.

Listing 3-6. Separate chaining implementation of `Hashtable`

class Hashtable:
  def __init__(self, M=10):
    self.table = [None] * M
    self.M = M
    self.N = 0

  def get(self, k):
    hc = hash(k) % self.M       
    entry = self.table[hc]      
    while entry:
      if entry.key == k:        
        return entry.value
      entry = entry.next
    return None

  def put(self, k, v):
    hc = hash(k) % self.M       
    entry = self.table[hc]      
    while entry:
      if entry.key == k:        
        entry.value = v         
        return
      entry = entry.next

    self.table[hc] = LinkedEntry(k, v, self.table[hc])  
    self.N += 1

: Compute index position, hc, of linked list for hash code that could contain k.
: Start with first node in the linked list.
: Traverse next reference until you find entry whose key matches k.
: Overwrite value associated with k.
: Prepend new node for (k, v) in table[hc] and increment count, N.

The get() and put() functions have nearly identical structures, with a while loop that visits each LinkedEntry node in the linked list. Starting with the first LinkedEntry in table[hc], the while loop visits each node exactly once until entry is None, which means all nodes were visited. As long as entry is not None, the entry.key attribute is inspected to see if there is an exact match with the k parameter. For get(), the associated entry.value is returned, while for put(), this value is overwritten with v. In both cases, in the worst case, the while loop will terminate once all entries in the linked list have been seen. When get() exhausts the linked list, it returns None since no entry was found; put() prepends a new entry (adds it at the beginning of the list), since it didn’t exist before.

Tip

put(k, v) only adds a new entry to a linked list after checking that none of the existing entries in the linked list have k as its key.

To evaluate the performance of this linked list structure, I need to count both the number of times a bucket is accessed as well as the number of times an entry node is inspected. It turns out that the performance of this linked list implementation is nearly identical to the open addressing implementation: the primary improvement is that there is no limit to the number of entries you can store in a linked list implementation. However, as I will discuss later in this chapter, if the number of entries, N, far exceeds the number of buckets, M, then the performance will degrade significantly. You also must consider that the memory requirement for storing the next references in the linked lists is twice as much as for open addressing.

Removing an Entry from a Linked List

Linked lists are versatile and dynamic data structures that can be extended, shortened, and spliced together efficiently because they do not rely on a fixed-size block of sequential memory. You can’t remove an index position from an array, but you can remove a node from a linked list.

Let’s break it down into two cases using a linked list containing three entries whose key values are shown in Figure 3-4. Let’s say you want to delete the entry whose key is 19. This is the first node in the linked list. To remove this entry, simply set the value of first to be first.next. The modified linked list now has two entries, starting with 26.

To delete any other entry (say, the one whose key is 26), look through the list, as shown in Figure 3-5. Stop when you find an entry with that key value, but keep a reference to the prior node, prev, during your search. Now set the value of prev.next to be entry.next. This cuts out this middle node and results in a linked list containing one fewer node. Note that this will also handle the case when entry is the final entry in the linked list, since then prev.next is set to None.

In both cases, the memory for the removed node is reclaimed. Listing 3-7 contains the remove(k) method in the Hashtable linked list implementation that removes the entry with the given key, should it exist.

Listing 3-7. `remove()` function for separate chaining implementation of `Hashtable`

def remove(self, k):
  hc = hash(k) % self.M
  entry = self.table[hc]             
  prev = None
  while entry:                       
    if entry.key == k:               
      if prev:
        prev.next = entry.next       
      else:
        self.table[hc] = entry.next  

      self.N -= 1                    
      return entry.value

    prev, entry = entry, entry.next  

  return None

: self.table[hc] refers to the first entry in the linked list associated with hash code hc.
: Continue iterating as long as there are entries in this linked list.
: Locate entry to remove by comparing target k with the key field for entry.
: When found, if there is a prev reference, link around entry, which removes it from the list.
: If there is no prev reference, then entry is first. Set the linked list at self.table[hc] to point to the second node in the linked list.
: Decrement count of entries, N. It is common to return the value associated with the entry being removed.
: If key is not found, continue iteration by setting prev to entry and advancing entry to the next entry.

By convention, the remove(k) function returns the value that had been associated with k or None if k was not present.

Evaluation

I now have two different structures that provide a symbol table data type to store (key, value) pairs. The linked list implementation has the added benefit of allowing (key, value) pairs to be removed, so you must choose that structure if you need this functionality. If you only add entries to a symbol table, however, you still need to evaluate the efficiency of these two approaches.

Let’s first evaluate the storage requirements to store N (key, value) pairs. Both approaches create an array of size M to hold entries. However, in the linked list approach, N can grow to be as large as necessary; for open addressing, N must be strictly smaller than M, so you must be sure to choose a large enough M in advance. The size of the memory requirements for table is directly proportional to M.

Ultimately, there will be N entries inserted into the symbol table. Each Entry in open addressing stores only the (key, value) pair, while the LinkedEntry for the linked list approach stores an additional next reference for each entry. Since each reference is a fixed memory size, the extra storage is directly proportional to N.

Open addressing requires storage proportional to both M and N, but since N < M, I can simply say that the storage is O(M).
Separate chaining requires storage proportional to both M and N, but since there are no restrictions on N, the storage is O(M + N).

To evaluate runtime performance, the key operation to count is the number of times an entry is inspected. Let’s start with the worst case, which occurs when the computed hash code for all keys is the same. In the linked list implementation, the table array contains M – 1 unused index positions, and a single linked list contains all N (key, value) pairs. The key to be searched might be the final entry in the linked list, so the get() performance in the worst case would be directly proportional to N. The situation in open addressing is similar: there would be N consecutive entries within the M-sized array, and the one you are looking for is the last one. I can safely say that regardless of implementation choice, in the worst case, get() is O(N).

This might seem like a deal-breaker, but it turns out that the mathematical hash functions do a really good job of distributing the hash code values for keys; as M increases, the probability of collisions decreases. Table 3-3 describes a head-to-head comparison of these two approaches by inserting N = 321,129 words from an English dictionary into a hashtable whose size, M, varies from N/2 to 2 × N. It also includes results for M = 20 × N (first row) and smaller values of M (the last five rows).

Table 3-3 shows two pieces of information for each (M, N) pair:

The average length of each nonempty chain in the Hashtable. This concept is applicable whether open addressing or linked lists are used.
The size of the maximum chain in the Hashtable. If the open addressing Hashtable becomes too congested—or linked lists become excessively long for certain hash codes—the runtime performance suffers.

Table 3-3. Average performance when inserting N = 321,129 keys into a `Hashtable` of size M as M decreases in size
	Linked list		Open addressing
M	Avg. chain len	Max chain len	Avg. chain len	Max chain len
6,422,580	`1.0`	`4`	`1.1`	`6`
…	`…`	`…`	`…`	`…`
642,258	`1.3`	`6`	`3.0`	`44`
610,145	`1.3`	`7`	`3.3`	`46`
579,637	`1.3`	`7`	`3.6`	`52`
550,655	`1.3`	`7`	`4.1`	`85`
523,122	`1.3`	`7`	`4.7`	`81`
496,965	`1.4`	`7`	`5.4`	`104`
472,116	`1.4`	`7`	`6.4`	`102`
448,510	`1.4`	`7`	`7.8`	`146`
426,084	`1.4`	`7`	`10.1`	`174`
404,779	`1.4`	`7`	`14.5`	`207`
384,540	`1.5`	`7`	`22.2`	`379`
365,313	`1.5`	`9`	`40.2`	`761`
347,047	`1.5`	`9`	`100.4`	`1429`
329,694	`1.6`	`8`	`611.1`	`6735`
313,209	`1.6`	`9`	`Fail`
…	`…`	`…`	`…`
187,925	`2.1`	`9`	`Fail`
112,755	`3.0`	`13`	`Fail`
67,653	`4.8`	`16`	`Fail`
40,591	`7.9`	`22`	`Fail`
24,354	`13.2`	`29`	`Fail`

The values in the table all increase as the size of M decreases; this occurs because smaller Hashtable sizes lead to more collisions, producing longer chains. As you can see, however, open addressing degrades much more quickly, especially when you consider that there are some hash codes whose chain extends to hundreds of entries to be inspected. Worse, once M is smaller than N, it becomes impossible to use open addressing (shown in the table as Fail). In contrast, the statistics for the linked list implementation appear to be almost immune from this situation. If the size, M, of a hashtable is much larger than N—for example, twice as large—then the average chain length is very close to 1, and even the maximum chain length is quite small. However, M has to be decided in advance, and if you use open addressing, you will run out of room once N = M – 1.

Things are much more promising with separate chaining. As you can see in Table 3-3, even when N is more than ten times the size of the hashtable, M, linked lists can grow to accommodate all of the entries, and the performance doesn’t suffer nearly as much as open addressing does when N became closer and closer to M. This is evident from the maximum chain length of the linked lists shown in Table 3-3.

These numbers provide the information to develop a strategy to ensure the efficiency of a hashtable whose initial size is M. The performance of a hashtable can be measured by how “full” it is—which can be determined by dividing N by M. Mathematicians have even defined the term alpha to represent the ratio N/M; computer scientists refer to alpha as the load factor of a hashtable.

For separate chaining, alpha represents the average number of keys in each linked list and can be larger than 1, limited only by the available memory.
For open addressing, alpha is the percentage of buckets that are occupied; its highest value is (M – 1)/M, so it must be smaller than 1.

Years of research have identified that hashtables become increasingly inefficient once the load factor is higher than 0.75—in other words, once an open addressing hashtable is ¾ full.⁷ For separate chaining hashtables, the concept still applies, even though they do not “fill up” in the same way.

Figure 3-6 plots the average chain size (shown in diamonds using the left Y-axis) and maximum chain size (shown in squares using the right Y-axis) after inserting N = 321,129 words into a Hashtable of size M (found on the X-axis). This graph effectively shows how you can compute a suitable M value to ensure a desired average (or maximum) chain size if you know the number of keys, N, to be inserted.

Statistics for hashtable are predictable

If the hashtable could only grow larger—that is, increase its M value—then the load factor would reduce and the hashtable would be efficient again. I next show how to accomplish this, with a little bit of effort.

Growing Hashtables

The DynamicHashtable in Listing 3-8 uses a load_factor of 0.75 to set a threshold target.

Listing 3-8. Determine `load_factor` and `threshold` when creating `DynamicHashtable`

class DynamicHashtable:
  def __init__(self, M=10):
    self.table = [None] * M
    self.M = M
    self.N = 0

    self.load_factor = 0.75
    self.threshold = min(M * self.load_factor, M–1)

: Ensure for M ≤ 3 that threshold is no greater than M – 1.

What if I simply doubled the size of the storage array, using a resize strategy known as geometric resizing? More precisely, double the array size and add 1 when resizing.8 Once the number of (key, value) pairs is larger than or equal to threshold, the table storage array needs to grow in size to remain efficient. The revised put() method for separate chaining is shown in Listing 3-9.

Listing 3-9. Revised `put()` method initiates `resize()`

def put(self, k, v):
  hc = hash(k) % self.M
  entry = self.table[hc]
  while entry:
    if entry.key == k:
      entry.value = v
      return
    entry = entry.next

  self.table[hc] = LinkedEntry(k, v, self.table[hc])  
  self.N += 1

  if self.N >= self.threshold:                        
    self.resize(2*self.M + 1)

: Prepend new entry to the table[hc] linked list chain.
: Check whether N meets or exceeds threshold for resizing.
: Resize storage array into new array that is twice original size plus one.

To increase storage in most programming languages, you need to allocate a new array and copy over all of the original entries from the first array into the second array, as shown in Figure 3-7. This figure shows the result after resizing the array storage for the array of linked lists as well as open addressing.

You should first observe that this copy operation will take time that is directly proportional to M—the greater the size of the hashtable storage array, the more elements need to be copied—so this operation can be classified as O(M). Simply copying the entries won’t actually work, however, since some entries—such as the ones with 19 and 26 as keys—will no longer be findable.

If you went to look for key 19, its hash code would now be 19 % 15 = 4, and that bucket is empty in both structures, indicating that no entry with a key of 19 exists in the hashtable. In the former open addressing Hashtable with size M = 7, key 19 had been placed in bucket 2 because linear probing wrapped around the end of the old array (of size 7) to the beginning when it was inserted. Now that the array has 15 elements, the wraparound doesn’t happen, so this key can no longer be found.

By pure coincidence, some entries are still findable, and these are highlighted in Figure 3-7. In open addressing, the entry with key 20 is not in table[5], but linear probing will find it in table[6]; similarly, the entry with key 15 is not in table[0], but linear probing will find it in table[1]. The entry with key 5 is findable in both open addressing and separate chaining because its hash code remains the same.

What can I do to avoid losing keys? The proper solution, shown in Listing 3-10, is to create a new temporary Hashtable with twice the original storage (technically, 2M + 1) and rehash all entries into the new Hashtable. In other words, for each of the (k, v) entries in the original Hashtable, call put(k,v) on the new temporary Hashtable. Doing so will guarantee these entries will remain findable. Then, with a nifty programming trick, the underlying storage array from temp is stolen and used as is, whether for separate chaining or for open addressing.

Listing 3-10. Resize method to dynamically grow hashtable storage for separate chaining

def resize(self, new_size):
  temp = DynamicHashtable(new_size)           
  for n in self.table:
    while n:
      temp.put(n.key, n.value)                
      n = n.next
  self.table = temp.table                     
  self.M = temp.M                             
  self.threshold = self.load_factor * self.M

: Construct temporary Hashtable of desired new size.
: For each node in the linked list for a given bucket, take all nodes and rehash each entry into temp.
: Grab the storage array from temp and use for our own.
: Be sure to update our M and threshold values.

This code is nearly identical for resizing with open addressing. I now have a strategy for dynamically increasing the size of a Hashtable. Figure 3-8 contains the proper resized hashtables for the open addressing example in Figure 3-2 and the separate chaining example in Figure 3-3.

How well does the resizing logic perform? Let’s conduct an experiment running 25 repeated trials using different M values to measure:

Build time: The time it takes to add N = 321,129 keys into a Hashtable with initial size M that doubles in size when threshold is exceeded.
Access time: The time to locate all N of these words once all keys are inserted.

For separate chaining and open addressing hashtables, Table 3-4 compares the build times and access times for dynamically resizable hashtables starting with different initial size, M, ranging from 625 to 640,000. This table also shows the performance of nongrowing hashtables with an initial size of M = 428,172. This will result in a fair comparison, since N = 321,129, or 428,172 × 0.75.

Table 3-4. Comparing growing tables against fixed-size construction (time in ms)
	Linked list	Linked list	Open addressing	Open addressing
M	Build time	Access time	Build time	Access time
625	`0.997`	`0.132`	`1.183`	`0.127`
1,250	`1.007`	`0.128`	`1.181`	`0.126`
2,500	`0.999`	`0.129`	`1.185`	`0.133`
5,000	`0.999`	`0.128`	`1.181`	`0.126`
10,000	`1.001`	`0.128`	`1.184`	`0.126`
20,000	`0.993`	`0.128`	`1.174`	`0.126`
40,000	`0.980`	`0.128`	`1.149`	`0.125`
80,000	`0.951`	`0.130`	`1.140`	`0.127`
160,000	`0.903`	`0.136`	`1.043`	`0.126`
320,000	`0.730`	`0.132`	`0.846`	`0.127`
640,000	`0.387`	`0.130`	`0.404`	`0.127`
…	`…`	`…`	`…`	`…`
Fixed	`0.380`	`0.130`	`0.535`	`0.131`

The last row presents the ideal case, where the load factor of the Hashtable does not exceed 0.75. You should expect open addressing to require more build time, because collisions on one bucket inevitably affect other buckets, whereas separate chaining localizes all collisions within the same bucket.

The other remaining rows show that you don’t have to magically predict the initial M to use, since the access time (to inspect all 321,129 keys) is essentially the same.

Analyzing the Performance of Dynamic Hashtables

In the worst case, both put() and get() for Hashtable is O(N). As I have explained before, if the hash code for each key computes to exactly the same bucket (for both separate chaining and open addressing), the time to complete each operation is directly proportional to N, the number of keys in the Hashtable.

However, under the commonly agreed-upon assumption of simple uniform hashing, hash functions will uniformly distribute keys into the buckets of the hashtable: each key has an equal probability of being placed in any bucket. From this assumption, mathematicians have proved that the average length of each chain is N/M. The takeaway is that you can rely on the experts who have developed the Python hash() function.

Since you can always guarantee that N < M when hashtables can grow, I can say for certain that the average quantity N/M is a constant, O(1), and is independent of N.

Note

A search can either hit (the key is in the hashtable) or miss (it is not). In open addressing (assuming uniform key distribution), on a hit the average number of buckets to inspect is (1 + 1/(1 – alpha))/2. This works out to 2.5 when alpha = 0.75. For a search miss, the result is (1 + 1 / (1 – alpha)²)/2. This works out to 8.5 under the same assumption.

I need to account for the extra costs of resizing the hashtable, working under the assumption that the threshold load factor is 0.75 and that the underlying storage doubles in size using geometric resizing. To start things off, let’s assume M is 1,023 and that N is much larger than M: I’ll use the 321,129-word English dictionary again. I need to count the number of times each key is inserted into a hashtable (including the temporary one in resize).

The first resize is triggered when the 768th key is added (since 768 is ≥ 767.25 = 0.75 × 1,023), which grows the hashtable to have M = 1,023 × 2 + 1, or 2,047. During this resize, 768 keys are rehashed and inserted. Note that immediately after the resize, the load factor is reduced in half to 768/2047, which is approximately 0.375.

When an additional 768 keys are inserted, all 1,536 keys are rehashed into a new hashtable of size M = 2,047 × 2 + 1, or 4,095. The hashtable is resized a third time when an additional 1,536 keys are inserted, which causes the existing 3,072 keys to be inserted into a hashtable of size M = 8,191.

To make sense of these numbers, Table 3-5 shows the resizing moments when the Nth word is inserted, together with the accumulated total number of times any key is inserted into the hashtable. During resize events, the final column shows that the average number of insertions (by dividing the total number of insertions by N) converges to 3. Even though resizing forces each key to be reinserted with a geometric resizing strategy, this table demonstrates you will never need more than three times as many insertions than without resizing.

Table 3-5. Words whose addition causes a resize event, with total number of insertions and average number of times a word was inserted
Word	M	N	# insert	average
absinths	`1,023`	`768`	`1,536`	`2.00`
accumulatively	`2,047`	`1,536`	`3,840`	`2.50`
addressful	`4,095`	`3,072`	`8,448`	`2.75`
aladinist	`8,191`	`6,144`	`17,664`	`2.88`
anthoid	`16,383`	`12,288`	`36,096`	`2.94`
basirhinal	`32,767`	`24,576`	`72,960`	`2.97`
cincinnatian	`65,535`	`49,152`	`146,688`	`2.98`
flabella	`131,071`	`98,304`	`294,144`	`2.99`
peps	`262,143`	`196,608`	`589,056`	`3.00`
…	`…`	`…`	`…`	`…`
zyzzyvas	`524,287`	`321,129`	`713,577`	`2.22`

The key observation is that geometric resizing ensures that resize events occur significantly less frequently as the size of the table grows. In Table 3-5, once the final word is inserted into the hashtable—which is not a resize event—the average has dropped to 2.22, and it will not need to be resized again until another 72,087 keys are added. As you can see, this is almost 100 times less frequently than when it started (when a resize was triggered after just 768 were added).

The result of this last analysis is that the average cost of inserting all 321,129 entries into a dynamically resizing hashtable is no more than three times what it would cost if the hashtable had somehow been large enough to store all N keys. As you have seen in Chapter 2, this is just a multiplicative constant, which will not change the performance classification of the average case for put(): it remains O(1) even with extra work due to resizing.

Perfect Hashing

If you know the collection of N keys in advance, then you can use a technique known as perfect hashing to construct an optimal hashtable where the hash code for each key is a unique index location. Perfect hashing generates the Python code containing the hash function to use. This is an unexpected result, and there are perfect hash generators for numerous programming languages.

If you install the third-party Python library perfect-hash, it can generate a perfect_hash() function from an input file containing the desired keys.9 Listing 3-11 contains the generated code using the words “a rose by any other name would smell as sweet.”

Listing 3-11. Perfect hashtable for ten words from Shakespeare

G = [0, 8, 1, 4, 7, 10, 2, 0, 9, 11, 1, 5]

S1 = [9, 4, 8, 6, 6]
S2 = [2, 10, 6, 3, 5]

def hash_f(key, T):
  return sum(T[i % 5] * ord(c) for i, c in enumerate(key)) % 12

def perfect_hash(key):
  return (G[hash_f(key, S1)] + G[hash_f(key, S2)]) % 12

Tip

The enumerate() built-in Python function improves how you iterate over lists when you also need positional information.

>>> for i,v in enumerate(['g', 't', 'h']):
      print(i,v)
0 g
1 t
2 h

enumerate() iterates over each value in a collection and additionally returns the index position.

Recall from Figure 3-1 how I defined a list, day_array, and a supporting base26() hash function that worked with twelve months? Perfect hashing pursues the same approach in a more elegant way, processing the N strings to create the lists G, S1, and S2 and a supporting hash_f() function.

To compute the index location for the string 'a', you need two intermediate results; recall that ord('a') = 97:

hash_f('a', S1) = sum([ S1[0] × 97 ]) % 12. Since S1[0] = 9, this is the value (9 × 97) % 12 = 873 % 12 = 9.
hash_f('a', S2) = sum([ S2[0] × 97 ]) % 12. Since S2[0] = 2, this is the value (2 × 97) % 12 = 194 % 12 = 2.

The value returned by perfect_hash('a') is (G[9] + G[2]) % 12 = (11 + 1) % 12 = 0. This means that the hash code for the string 'a' is 0. Repeat this computation¹⁰ for the string 'by' and you will find that:

hash_f('by', S1) = (9 × 98 + 4 × 121) % 12 = 1,366 % 12 = 10.
hash_f('by', S2) = (2 × 98 + 10 × 121) % 12 = 1,406 % 12 = 2.
perfect_hash('by') is (G[10] + G[2]) % 12 = (1 + 1) % 12 = 2.

To summarize, the key 'a' hashes to index location 0, and the key 'by' hashes to index location 2. In fact, each of the words in “a rose by any other name would smell as sweet” is hashed to a different computed index location. It is sometimes truly amazing what mathematics can do!

Listing 3-12 contains the perfect_hash() function for the 321,129 words in the sample dictionary. This function computes 0 for the first English word, 'a', and 321,128 for the last English word, 'zyzzyvas'. It is supported by a large list, G, containing 667,596 values (not shown, obviously!) and two intermediate lists, S1 and S2.

For the string 'by' in this larger perfect hashtable, you can confirm the following:

hash_f('by', S1) = (394,429 × 98 + 442,829 × 121) % 667,596 = 92,236,351 % 667,596 = 108,103
hash_f('by', S2) = (14,818 × 98 + 548,808 × 121) % 667,596 = 67,857,932 % 667,596 = 430,736
perfect_hash('by') = (G[108,103] + G[430,736]) % 667,596 = (561,026 + 144,348) % 667,596 = 37,778

Listing 3-12. Partial listing of perfect hash function for English dictionary

S1 =  [394429, 442829, 389061, 136566, 537577, 558931, 481136,
       337378, 395026, 636436, 558331, 393947, 181052, 350962, 657918,
       442256, 656403, 479021, 184627, 409466, 359189, 548390, 241079, 140332]
S2 =  [14818, 548808, 42870, 468503, 590735, 445137, 97305,
       627438, 8414, 453622, 218266, 510448, 76449, 521137, 259248, 371823,
       577752, 34220, 325274, 162792, 528708, 545719, 333385, 14216]

def hash_f(key, T):
  return sum(T[i % 24] * ord(c) for i, c in enumerate(key)) % 667596

def perfect_hash(key):
  return (G[hash_f(key, S1)] + G[hash_f(key, S2)]) % 667596

The computation in perfect_hash(key) in Listing 3-12 produces a large sum that is reduced using % 667,596 to identify a unique location from the large G list. As long as key is a valid English word from the dictionary, perfect_hash(key) uniquely identifies an index from 0 to 321,128.

If you inadvertently attempt to hash a key that is not an English word, a collision will occur: the word 'watered' and the nonword 'not-a-word' both hash to the index location 313,794. This is not an issue for perfect hashing, since the programmer is responsible for ensuring that only valid keys are ever hashed.

Iterate Over (key, value) Pairs

Hashtables are designed for efficient get(k) and put(k,v) operations. It might also be useful to retrieve all entries in a hashtable, whether it uses open addressing or separate chaining.

Tip

Python generators are one of the best features of the language. Most programming languages force programmers to return the values in a collection using extra storage. In Chapter 2, I explained how range(0, 1000) and range(0, 100000) both use the same amount of memory while returning all integers in the respective ranges; generators make this possible.

The following generator function produces the integers in the range from 0 to n that do not include the given digit:

def avoid_digit(n, digit):
  sd = str(digit)
  for i in range(n):
    if not sd in str(i):
      yield i

To give an object this same capability in Python, a class can provide an __iter__() method that allows the caller to use the for v in object idiom.

The two __iter__() implementations in Listing 3-13 are designed for separate chaining and open addressing hashtables.

Listing 3-13. Iterate over all entries in a hashtable using Python generator function

# Iterator for Open Addressing Hashtable
def __iter__(self):
  for entry in self.table:
    if entry:                         
      yield (entry.key, entry.value)  

# Iterator for Separate Chaining Hashtable
def __iter__(self):
  for entry in self.table:
    while entry:                      
      yield (entry.key, entry.value)  
      entry = entry.next

: Skip over table entries that are None.
: Generate a tuple containing the key and value using Python yield.
: As long as there is a linked list at this bucket, yield a tuple for each node.
: Set entry to the next entry in the linked list, or None if none are left.

To demonstrate how these iterators work, construct two hashtables with M equal to 13—one using open addressing and one using separate chaining—and a third hashtable using perfect hashing. After inserting the words in the string “a rose by any other name would smell as sweet,” Table 3-6 shows the words generated by the respective hashtables.

Table 3-6. Order of words returned by hashtable iterators
Open addressing	Separate chaining	Perfect hash
a	sweet	a
by	any	any
any	a	as
name	would	by
other	smell	name
would	other	other
smell	as	rose
as	name	smell
sweet	by	sweet
rose	rose	would

The words returned for the open addressing and separate chaining hashtables appear to be in a random order; of course, this isn’t randomness but based solely on the way keys are hashed. If you execute the sample code that produces this table, your ordering for open addressing and separate chaining will likely be different, because the hash() code values for strings in Python 3 are unpredictable.

One nice feature of the perfect-hash library is that the index positions computed by perfect_hash(key) are based on the order of the words used when generating the perfect hashing code. Simply use a list of strings that are already in sorted order, and the entries will be stored in sorted order, and the iterator will yield the pairs in the same order.

Chapter 8 contains more details of the Python dict type. Instead of implementing a symbol table from scratch, as I have done in this chapter, you should always use dict because it is a built-in type and will be significantly more efficient than the code I provide.

Summary

In this chapter, I introduced several key concepts:

The linked list data structure can store small collections of items efficiently, allowing for dynamic insertion and removal.
Symbols tables use a storage array with M buckets to store entries. There needs to be a strategy to resolve collisions when two or more keys hash to the same bucket.
Open addressing relies on distributing entries within an array to reduce the number of collisions, but this only works efficiently when the size of the storage array, M, is more than twice as large as the number of stored entries, N. See the challenge exercises for two approaches that support removing keys.
Separate chaining uses linked lists to store entries whose keys hash to the same bucket. This approach makes it easier to support a remove operation for keys.
Designing hash functions is hard—use the predefined ones designed by Python’s designers.
Geometric resizing ensures a symbol table remains efficient by decreasing the frequency of future resize events.
Perfect hashing can be used to construct hash functions that avoid collisions by computing a unique bucket index location for a fixed number of keys; the hash function is often more computationally expensive than a default hash() function.

Challenge Exercises

Does open addressing improve if you use a different strategy for resolving collisions? Instead of using a linear probe with a delta of 1 (wrapping around), create a hashtable whose size is a power of 2, and use a probe sequence that explores additional index positions using deltas that are triangle numbers, that is, the integers 1, 3, 6, 10, 15, 21, and so on; you will still have to wrap around. The nth triangle number is the sum of the integers from 1 to n, represented by the formula n × (n + 1)/2.

Is the overall performance better than linear probing?

Conduct an experiment where you fill a Hashtable of size 524,288 with the first 160,564 English words (for a utilization of 30%), and then measure the time it takes to search for all 321,129 words. Compare this performance against an open addressing Hashtable and the separate chaining variation.

Is it worth sorting the keys in the linked lists of a separate chaining hashtable? Construct a Hashtable variant that sorts the (key, value) pairs in each linked list in ascending order by key.

Conduct an experiment where you measure the time it takes to construct the initial Hashtable of size 524,287 (a prime number) with the first 160,564 English words (for a utilization of 30%) in reverse order. As you will see, the worst case for this variation occurs when the keys are put() into the hashtable in increasing value. Compare this performance against an open addressing Hashtable and the regular separate chaining Hashtable.

Now measure the time it takes to search the first 160,564 English words as keys. As you might expect, this is the best case, since all these words are in the hashtable. Compare this performance against the array-based Hashtable and the separate chaining variation. Next search for the remaining 160,565 words in the back half of the English dictionary. This provides a glimpse of the worst case since finding these words in a linked list will always require each linked list to be fully explored. Once again, compare this performance against the open addressing Hashtable and the regular unordered linked list variation.

How sensitive are these results to the chosen initial size, 524,287? For example, compare with 214,129 (a utilization of 75%) and 999,983 (a utilization of 16%).

To see the dangers in predictable hash codes, consider the ValueBadHash code in Listing 3-14. The objects for this Python class hash to just four different values (0 to 3). This class overrides the default behavior of hash() as well as __eq__(), so these objects can be used as a key when invoking put(key, v).
Listing 3-14. ValueBadHash has a horrible hash() function
```
class ValueBadHash:
  def __init__(self, v):
    self.v = v

  def __hash__(self):
    return hash(self.v) % 4

  def __eq__(self, other):
    return (self.__class__ == other.__class__ and self.v == other.v)
```
Construct a separate chaining Hashtable(50,000) and invoke put(ValueBadHash(w), 1) for the first 20,000 English words in the dictionary. Next, create a regular separate chaining Hashtable, and invoke put(w, 1) for these same words. Generate statistics on:
- The average chain length of a bucket
- The maximum chain length of any bucket in Hashtable
Be prepared for the code execution to take a long time! Explain these statistics.
Having a prime number for the number of buckets, M, is helpful in practice because every key that shares a common factor with M will be hashed to a bucket that is a multiple of this factor. For example, if M = 632 = 8 × 79 and the entry to be inserted has a key of 2,133 = 27 × 79, the hash code is 2,133 % 632 = 237 and 237 = 79 × 3. The problem is that the performance of a hashtable assumes uniform distribution of keys, which is violated when some keys are predisposed to be placed in certain buckets.

To demonstrate the impact of M, form a key from the base26 representation of each of the words in the 321,129-word English dictionary. In the range from 428,880 to 428,980 containing 101 potential values for M, construct fixed-size hashtables (using both open addressing and separate chaining) and produce a table showing the average and maximum chain length size. Are there any values of M in this range that are particularly bad? Can you find out what these values all have in common? Armed with this knowledge, scan the next 10,000 higher values of M (up to 438,980) to try to find one whose maximum chain length is almost ten times as bad.
When using an open addressing hashtable, there is no easy way to support remove(key) because removing an entry from a bucket may break an existing chain created using linear probing. Imagine you have a Hashtable with M = 5 and you hashed entries with key values of 0, 5, and then 10. The resulting table array would contain the following keys [0, 5, 10, None, None], since each collision would be resolved using linear probing. A flawed remove(5) operation would simply remove the entry with key 5 from table, resulting in an array storing [0, None, 10, None, None]. However, the entry with key 10 is no longer findable because the chain at index location 0 was broken.

One strategy is to add a Boolean field to the Entry that records whether the entry has been deleted or not. You have to modify get() and put() accordingly. In addition, the resize event needs to be adjusted as well, since entries marked as deleted will not need to be inserted into the new hashtable. Don’t forget to update the __iter__() method to skip deleted entries.

Consider adding logic to initiate a shrink event when more than half of the entries in the hashtable are marked for deletion. Run some trials to compare the performance of separate chaining vs. open addressing now that remove() is available for open addressing.

Resizing a hashtable incurs a steep penalty since all N values need to be rehashed into the new structure. In incremental resizing, instead, the resize event allocates a new array, newtable (with size 2M + 1), but the original hashtable remains. A get(key) request first searches for key in newtable before searching through the original table. A put(key,value) request inserts a new entry into newtable: after each insertion, delta elements from the old table are rehashed and inserted into newtable. Once all elements are removed from the original table, it can be deleted.

Implement this approach with a separate chaining hashtable, and experiment with different values of delta. In particular, what is the smallest value of delta that will ensure the old table is completely emptied before the next resize event? Is delta a constant, or is it based on M or even N?

This approach will reduce the total cost incurred at any time by put(); perform empirical measurements using an example similar to Table 3-5, but now evaluating the runtime performance cost of the most expensive operation.
Should the number of (key, value) pairs, N, in a hashtable become less than ¼ of M, the table storage array could shrink to reduce unneeded storage space. This shrink event can be triggered by remove().

Modify either the separate chaining or open addressing hashtable to realize this capability, and run some empirical trials to determine if it is worthwhile.
Use a symbol table to find the element in a list that is duplicated the most number of times. In the case of a tie, any value that is repeated “the most number of times” can be returned.

most_duplicated([1,2,3,4]) can return 1, 2, 3, or 4, while most_duplicated([1,2,1,3]) must return 1.
An open addressing hashtable can remove entries by rehashing all remaining (key, value) pairs in the chain after the one that is to be removed. Add this capability to an open addressing hashtable, and run some trials to compare the performance of separate chaining versus open addressing with this revised remove capability.

1 Throughout this chapter, the notation (key, value) represents a pair of information considered as a single unit.

² Note that 26³ = 17,576.

³ Originally based on the English alphabet, ASCII encodes 128 characters (typically those found on a typewriter keyboard) into seven-bit integers. Capitalization matters! ord('A') = 65, for example, but ord('a') = 97.

⁴ Java computes 32-bit hash function values, and sometimes two string keys have the exact same hash() value; for example, both misused and horsemints hash to 1,069,518,484.

⁵ In Java, if hash(key) is negative, then the % operator will return a negative number, so the formula must be (key.hashCode() & 0x7fffffff) % M to first convert the negative hash computation into a positive integer before computing modulo M.

⁶ See https://oreil.ly/C4V0W to learn more and try the challenge exercise at the end of this chapter.

⁷ The Python dict type uses ⅔ as the threshold.

⁸ It is commonly observed that Hashtables whose number of buckets is a prime number work really well (see challenge exercise at end of this chapter); here the size is odd, which can also be helpful.

⁹ Use the Python pip installer like this: pip install perfect-hash.

¹⁰ ord('b') = 98, and ord('y') = 121.

Chapter 3. Better Living Through Better Hashing

Associating Values with Keys

Listing 3-1. Code to print readable calendar for any month and year

Note

Listing 3-2. Convert word to an integer assuming base 26

Tip

Figure 3-1. Array containing month lengths interspersed with unneeded –1 values

Hash Functions and Hash Codes

Note

Note

A Hashtable Structure for (Key, Value) Pairs

Listing 3-3. Ineffective hashtable implementation

Detecting and Resolving Collisions with Linear Probing

Figure 3-2. Structure of Hashtable storage after adding five (key, value) entries

Note

Listing 3-4. Open addressing implementation of Hashtable

Note

Separate Chaining with Linked Lists

Listing 3-5. LinkedEntry node structure to support linked list of (key, value) pairs

Note

Figure 3-3. Structure of Hashtable linked list storage after adding five (key, value) pairs

Listing 3-6. Separate chaining implementation of Hashtable

Tip

Removing an Entry from a Linked List

Figure 3-4. Removing the first node in a linked list

Figure 3-5. Removing any other node in a linked list

Listing 3-7. remove() function for separate chaining implementation of Hashtable

Evaluation

Figure 3-6. For fixed number of elements N, average and maximum chain length follow predictable paths

Growing Hashtables

Listing 3-8. Determine load_factor and threshold when creating DynamicHashtable

Listing 3-9. Revised put() method initiates resize()

Figure 3-7. Some entries can get “lost” if they are simply copied when M increases

Listing 3-10. Resize method to dynamically grow hashtable storage for separate chaining

Figure 3-8. Resulting hashtable storage after successful resizing

Analyzing the Performance of Dynamic Hashtables

Note

Perfect Hashing

Listing 3-11. Perfect hashtable for ten words from Shakespeare

Tip

Listing 3-12. Partial listing of perfect hash function for English dictionary

Iterate Over (key, value) Pairs

Tip

Listing 3-13. Iterate over all entries in a hashtable using Python generator function

Summary

Challenge Exercises

Listing 3-14. ValueBadHash has a horrible hash() function

Figure 3-1. Array containing month lengths interspersed with unneeded `–1` values

Figure 3-2. Structure of `Hashtable` storage after adding five (key, value) entries

Listing 3-4. Open addressing implementation of `Hashtable`

Listing 3-5. `LinkedEntry` node structure to support linked list of (key, value) pairs

Figure 3-3. Structure of `Hashtable` linked list storage after adding five (key, value) pairs

Listing 3-6. Separate chaining implementation of `Hashtable`

Listing 3-7. `remove()` function for separate chaining implementation of `Hashtable`

Listing 3-8. Determine `load_factor` and `threshold` when creating `DynamicHashtable`

Listing 3-9. Revised `put()` method initiates `resize()`

Listing 3-14. `ValueBadHash` has a horrible `hash()` function