Instead of just storing values, you might need to store a collection of (key, value) pairs to associate values with specific keys. This is known as a symbol table data type, which lets you find the associated value given just its key. Hashing offers an efficient alternative to manually searching through a collection from start to finish just to find a (key, value) pair. It outperforms the search algorithms I covered earlier. A symbol table can be efficient even while allowing for keys (and their values) to be removed. You give up the ability to retrieve all keys in a specific order, for example, in ascending order, but the resulting symbol table provides optimal performance for retrieving or storing the value associated with any individual key.
Let’s say you want to write a function print_month(month, year)
to print
a calendar for any month and year. For example, print_month('February',
2024)
would output the following:
February 2024 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
What information would you need? You need the weekday for the first day of
that month in that year (above it is Thursday), and you
need to know that February has 28 days (or 29 in a leap year, such as
2024). You might use a fixed array, month_length
, with values that
record the length of each month in days for the year:
month_length
=
[
31
,
28
,
31
,
30
,
31
,
30
,
31
,
31
,
30
,
31
,
30
,
31
]
January, the first month, has 31 days, so month_length[0]
=
31. February is the next month, with 28 days, so the next value in the
list is 28. The final value in month_length
is 31, since December,
the last month of the year, has 31 days.
Given what I have presented in this book so far, you might choose to store
a key_array
of the same length and search it for the month to determine
the corresponding value in month_length
; the following code prints February
has 28 days
:
key_array
=
[
'January'
,
'February'
,
'March'
,
'April'
,
'May'
,
'June'
,
'July'
,
'August'
,
'September'
,
'October'
,
'November'
,
'December'
]
idx
=
key_array
.
index
(
'February'
)
(
'February has'
,
month_length
[
idx
],
'days'
)
While the preceding code snippet will work, in the worst case, the key you are
looking for is either the last one in the list (December) or an invalid
month name, which means you have to inspect all values in the array. This
means the time to locate the associated value for a key is directly
proportional to the number of keys stored. If there were hundreds of
thousands of (key, value) pairs, this approach would rapidly become so
inefficient as to be unusable. Given this starting point, you should try to
implement print_month()
; for the record, Listing 3-1 contains
my implementation, which uses the datetime
and calendar
Python modules.
from
datetime
import
date
import
calendar
def
print_month
(
month
,
year
)
:
idx
=
key_array
.
index
(
month
)
day
=
1
wd
=
date
(
year
,
idx
+
1
,
day
)
.
weekday
(
)
wd
=
(
wd
+
1
)
%
7
end
=
month_length
[
idx
]
if
calendar
.
isleap
(
year
)
and
idx
==
1
:
end
+
=
1
(
'
{} {}
'
.
format
(
month
,
year
)
.
center
(
20
)
)
(
'
Su Mo Tu We Th Fr Sa
'
)
(
'
'
*
wd
,
end
=
'
'
)
while
day
<
=
end
:
(
'
{:2d}
'
.
format
(
day
)
,
end
=
'
'
)
wd
=
(
wd
+
1
)
%
7
day
+
=
1
if
wd
==
0
:
(
)
(
)
Find index to use for month_length
, an integer from 0 to 11.
Returns weekday for first day of given month, using 0 for Monday. Note date()
uses idx + 1
since its month argument must be an integer from 1 to 12.
Adjust to have Sunday be the weekday with a value of 0, instead of Monday.
Determine length of the month corresponding to input parameter.
In a leap year, February (index 1 when counting from 0) has 29 days.
Add spaces in first week to start day 1 at right indentation.
Advance day and weekday for next day.
Insert line break before Sunday to start a new line.
For the task of finding the associated value for a given key in a
collection of N (key, value) pairs, I judge efficiency by counting the
number of times any array index is accessed. A function searching for a
string in key_array
could inspect up to N array index positions, so that
function’s performance is O(N).
Python provides a list
data structure instead of an array. While lists can dynamically grow in size, in this chapter, I continue to use the term array since I do not take advantage of this capability.
Python has a built-in dict
type (an abbreviation of dictionary) that
associates values with keys. In the following, days_in_month
is a dict
that
associates an integer value (representing the length of that month) with
a string key containing the capitalized English name:
days_in_month
=
{
'January'
:
31
,
'February'
:
28
,
'March'
:
31
,
'April'
:
30
,
'May'
:
31
,
'June'
:
30
,
'July'
:
31
,
'August'
:
31
,
'September'
:
30
,
'October'
:
31
,
'November'
:
30
,
'December'
:
31
}
The following code prints April has 30 days
:
(
'April has'
,
days_in_month
[
'April'
],
'days'
)
The dict
type can locate a key with an average performance of O(1
),
independent of the number of (key, value) pairs it stores. This is an
amazing accomplishment, like a magician pulling a rabbit from a hat! In
Chapter 8, I provide more details about dict
. For now, let’s see how it
works. The following code provides some mathematical intuition as to how
this happens. The key idea is to turn a string into a number.
Try this: consider the letter 'a'
to be the value 0, 'b'
to be 1,
and so on through 'z'
= 25. You can then consider 'June'
to be a number
in base 26, which represents the base 10 value j × 17,576 + u × 676 + n × 26 + e = 172,046 in base 10.2 This computation can also be written as 26 × (26 × (26 × j + u) + n) + e, which reflects the structure of the base26()
method shown in Listing 3-2.
def
base26
(
w
)
:
val
=
0
for
ch
in
w
.
lower
(
)
:
next_digit
=
ord
(
ch
)
-
ord
(
'
a
'
)
val
=
26
*
val
+
next_digit
return
val
base26()
uses the ord()
function to convert a single character (such as
'a'
) into its integer ASCII representation.3 ASCII codes are ordered alphabetically, so ord('a')
= 97, ord('e')
= 101 and
ord('z')
= 122. To find the value associated with 'e'
, simply compute
ord('e') – ord('a')
to determine that 'e'
represents the value 4.
When computing base26()
values for a string, the resulting numbers
quickly grow large: 'June'
computes to the value 172,046, while
'January'
computes to 2,786,534,658.
How can these numbers be cut down to a more manageable size? You might know
about the modulo operator %
provided by most programming
languages. This operator returns the integer remainder when dividing a
large integer (i.e., the base26(month)
computation) by a much smaller
integer.
You have likely used modulo in real life without knowing it. For example, if it is currently 11:00 a.m., what time of day will it be in 50 hours? You know in 24 hours it will once again be 11:00 a.m. (on the next day), and after 48 hours it will again be 11:00 a.m. (on the second day). This leaves only two hours to account for, which tells you that in 50 hours it will be 1:00 p.m. Mathematically speaking, 50 % 24 = 2. In other words, you cannot evenly divide 50 by 24 since you will have 2 left over. When N and M are positive integers, N % M is guaranteed to be an integer from 0 to M – 1.
With some experimentation, I discovered that base26(m)
% 34 computes
a different integer for each of the twelve English month names; for
example, base26('August')
computes to 9,258,983, and you can confirm
that 9,258,983 % 34 = 1. If you create a single array containing 34
values, as shown in Figure 3-1, then independently of the number of (key, value) pairs, you can determine the number of days in a month by computing its associated index into day_array
. This means that August has 31 days; a similar computation for February shows it has 28 days.
Take a moment to reflect on what I have accomplished. Instead of iteratively searching for a key in an array, I can now perform a simple computation on the key itself to compute an index position that contains its associated value. The time it takes to perform this computation is independent of the number of keys stored. This is a major step forward!
–1
valuesBut this was a lot of work. I had to (a) craft a special formula to compute unique index positions, and (b) create an array containing 34 integers, only 12 of which are relevant (meaning that more than half of the values are wasted). In this example, N equals 12, and the amount of dedicated storage, M, equals 34.
Given any string s
, if the value in day_array
at index position base26(s)
% 34 is –1, you know s
is an invalid month. This is a good feature. You might be tempted to confirm that a given string, s
, is a valid month whenever day_array[base26(s)%34]
> 0, but this would be a mistake. For example, the string 'abbreviated'
computes to the same index position as 'March'
, which might mistakenly tell you that 'abbreviated'
is a valid month! This is a bad feature, but I now show how to overcome this issue.
base26()
is an example of a hash function that maps keys of arbitrary
size to a fixed-size hash code value, such as a 32-bit or 64-bit
integer. A 32-bit integer is a value from –2,147,483,648 to
2,147,483,647, while a 64-bit integer is a value from
–9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. As you can
see, a hash code can be negative.
Mathematicians have been studying hashing for decades, developing
computations to convert structured data into fixed-size integers.
Programmers can take advantage of their contributions since most
programming languages have built-in support for computing a hash code for
arbitrary data. Python provides such a hash()
method for immutable
objects.
The only necessary property of a hash function is that two objects that are equal to each other must compute to the same hash code; it cannot simply be a random number. When hashing an immutable object (such as a string), the computed hash code is usually stored to reduce overall computation.
Hash functions do not have to compute a unique hash for each key; this is
too great a computational challenge (but see the section on perfect hashing
at the end of this chapter). Instead, the expression hash(key) % M
uses
the modulo operation to compute a value guaranteed to be an integer
from 0 to M – 1.
Table 3-1 lists the 64-bit hash()
value for a few string
keys and the corresponding hash code modulo expressions. Mathematically,
the probability that two keys have the exact same hash()
value is
vanishingly small.4 There are two potential hash code collisions; both smell and rose have a hash code of 6, while both name and would have a hash code of 10. You should expect that two different strings could have
the exact same hash code value.
key |
hash(key) |
hash(key) % 15 |
---|---|---|
a |
|
|
rose |
|
|
by |
|
|
any |
|
|
other |
|
|
name |
|
|
would |
|
|
smell |
|
|
as |
|
|
sweet |
|
|
If I use hash(key) % M
to compute the hash code for key
, then M must be at least as large as the number of expected keys to make room to store all associated values.5
Currently, Java and Python 2 compute a predictable hash code for strings. In Python 3, the default hash()
code values for strings are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python as a cybersecurity measure. Specifically, if
a hacker can generate keys that produce specific hash code values (thus violating the uniform distribution of keys), the performance of the hashtable defined in this chapter degrades to O(N), leading to denial-of-service attacks.6
The following Entry
structure stores a (key, value) pair:
class
Entry
:
def
__init__
(
self
,
k
,
v
):
self
.
key
=
k
self
.
value
=
v
Listing 3-3 defines a Hashtable
class that constructs a table
array capable of
storing up to M Entry
objects. Each of these M index positions in the
array is called a bucket. For our first attempt, either a bucket is
empty or it contains a single Entry
object.
class
Hashtable
:
def
__init__
(
self
,
M
=
10
)
:
self
.
table
=
[
None
]
*
M
self
.
M
=
M
def
get
(
self
,
k
)
:
hc
=
hash
(
k
)
%
self
.
M
return
self
.
table
[
hc
]
.
value
if
self
.
table
[
hc
]
else
None
def
put
(
self
,
k
,
v
)
:
hc
=
hash
(
k
)
%
self
.
M
entry
=
self
.
table
[
hc
]
if
entry
:
if
entry
.
key
==
k
:
entry
.
value
=
v
else
:
raise
RuntimeError
(
'
Key Collision: {} and {}
'
.
format
(
k
,
entry
.
key
)
)
else
:
self
.
table
[
hc
]
=
Entry
(
k
,
v
)
Allocate a table
to hold M
Entry
objects.
The get()
function locates the entry
associated with the hash code for k
and returns its value, if present.
The put()
function locates the entry
associated with the hash code for k
, if present, and overwrites its value; otherwise it stores a new entry.
A collision occurs when two different keys map to the same bucket identified by its hash code value.
You can use this hashtable as follows:
table
=
Hashtable
(
1000
)
table
.
put
(
'April'
,
30
)
table
.
put
(
'May'
,
31
)
table
.
put
(
'September'
,
30
)
(
table
.
get
(
'August'
))
# Miss: should print None since not present
(
table
.
get
(
'September'
))
# Hit: should print 30
If everything works properly, three Entry
objects will be created for these three (key, value) pairs in an array that has room for 1,000 objects. The performance of put()
and get()
will be independent of the number of Entry
objects in the array, so each action can be considered to have constant time performance, or O(1
).
A miss occurs when get(key)
fails to find an Entry
in the bucket
identified by the hash code for key
. A hit occurs when get(key)
finds an Entry
whose key matches key
. These are both normal behaviors. However, there is still no strategy to resolve collisions when the hash codes for two (or more) keys compute to the same bucket. If you don’t resolve collisions, then a put()
could overwrite an existing Entry
for a different key, causing the Hashtable
to lose keys—this must be prevented.
It may happen that two entries, e1
= (key1
, val1
) and e2
= (key2
, val2
), share the same hash code for their respective keys even though key1 is different from key2. In Table 3-1, both 'name'
and 'would'
have the same hash code of 10. Assume that e1
is put in the Hashtable
first. You will encounter a hash collision later when you try to put e2
in the same Hashtable
: this is because the identified bucket is nonempty with an Entry
whose key, key1
, is different from key2
, the key for e2
. If you can’t resolve these collisions, you won’t be able to store both e1
and e2
in the same Hashtable
.
Open addressing resolves hash collisions by probing (or searching) through alternative locations in table
when put()
encounters a collision. One common approach, called linear probing, has put()
incrementally check higher index positions in table
for the next available empty bucket if the one designated by the hash code contains a different entry; if the end of the array is reached without finding an empty bucket, put()
continues to search through table
starting from index position 0. This search is guaranteed to succeed because I make sure to always keep one bucket empty; an attempt to put
an entry into the last remaining empty bucket will fail with a runtime error signifying the hashtable is full.
You might wonder why this approach will even work. Since entries can be
inserted into an index position different from their hash(key)
, how will you find them later? First, you can see that entries are never removed from Hashtable
, only inserted. Second, as more Entry
objects are inserted into buckets in the table
structure, long runs of nonempty buckets appear in table
. An Entry
, e1
, could now be in a different location, anywhere from hash(e1.key)
or, searching to the right, until the next empty bucket is found (wrapping around as necessary).
To demonstrate collision resolution, let’s now add five entries into a
Hashtable
, with M = 7 buckets in Figure 3-2
(which only shows their keys). Buckets shaded in gray are empty. Key 20 is inserted into table[6]
, since 20 % 7 = 6; similarly, 15 is inserted into table[1]
, and 5 is inserted into table[5]
.
Hashtable
storage after adding five (key, value) entriesAdding an entry with key 26 causes a collision since table[5]
is
already occupied (by an entry whose key is 5, highlighted in
Figure 3-2), so linear probing next checks table[6]
, which is also unavailable, and so the entry is placed in the next available bucket, table[0]
(because open addressing wraps around when searching for the next available empty index position). Adding an entry with key 19 similarly causes a collision with every nonempty bucket, and the entry is finally placed in table[2]
.
Using the existing Hashtable
in Figure 3-2, I can put
one additional entry into the table (since I need to leave one bucket empty). Where would I place an entry with a key of 44? The hash code for its key is 44 % 7 = 2, which is occupied; searching for the next available
bucket places this entry in table[3]
. Since get()
and put()
use the same search strategy, this entry will eventually be found later when invoking get(44)
.
A chain of entries for hash code hc
in an open addressing Hashtable
is a sequence of consecutive Entry
objects in table
. This sequence starts at a given table[hc]
and extends to the right (wrapping around to the first index in table
as needed), up to but not including the next available unused table
index position. In Figure 3-2, the chain for hash code 5 has a length of 5 even though there are only three entries whose key has that hash code. Also, there are no keys with a hash code of 2, but the chain for hash code 2 has a length of 1 because of an earlier collision. The maximum length of any chain is M – 1 because one bucket is left empty.
Listing 3-4 shows the modifications to Hashtable
to
support open addressing; the Entry
class is unchanged. Hashtable
keeps a running count, N, of the number of entries it stores so it can ensure there is at least one unused bucket (though it does not need to keep track of where that empty bucket is!). This is critically important to ensure the while
loop in the get()
and put()
functions will eventually terminate.
Hashtable
class
Hashtable
:
def
__init__
(
self
,
M
=
10
)
:
self
.
table
=
[
None
]
*
M
self
.
M
=
M
self
.
N
=
0
def
get
(
self
,
k
)
:
hc
=
hash
(
k
)
%
self
.
M
while
self
.
table
[
hc
]
:
if
self
.
table
[
hc
]
.
key
==
k
:
return
self
.
table
[
hc
]
.
value
hc
=
(
hc
+
1
)
%
self
.
M
return
None
def
put
(
self
,
k
,
v
)
:
hc
=
hash
(
k
)
%
self
.
M
while
self
.
table
[
hc
]
:
if
self
.
table
[
hc
]
.
key
==
k
:
self
.
table
[
hc
]
.
value
=
v
return
hc
=
(
hc
+
1
)
%
self
.
M
if
self
.
N
>
=
self
.
M
-
1
:
raise
RuntimeError
(
'
Table is Full.
'
)
self
.
table
[
hc
]
=
Entry
(
k
,
v
)
self
.
N
+
=
1
Start at the first bucket where an entry whose key is k
could be.
If found, return the value associated with k
.
Otherwise advance to the next bucket, wrapping around to 0 as needed.
If table[hc]
is empty, you know k
is not in table
.
If found, update the value associated with the Entry
for k
.
Once you’ve established that k
is not in table
, throw RuntimeError
if hc
is the last empty bucket.
Create new Entry
in table[hc]
and update N
, the count of keys.
Given this new mechanism, what is the performance of get()
and put()
? Is the time to perform these operations independent of N, the number of keys in the Hashtable
? No!
Consider what happens in a worst case scenario. To an empty Hashtable
, add an entry, and assume it is placed in table[0]
. What if each of the next N – 1 requests to add an entry collides with this existing key in the table
storage? What is the total number of buckets searched? For the first request, it is 1; for the second request, 2, and for the third request, 3. As you can see, there will be k
buckets inspected for request number k
. The total number, then, is the sum of 1 + 2 + … + (N – 1), which conveniently equals N × (N – 1)/2. The average computation is this quantity divided by N, which leaves (N – 1)/2.
Based on what I presented in Chapter 2, I can classify (N – 1)/2 as O(N) as follows: First, this formula can be written as N/2 – ½. As the problem instance size increases, the dominant term is N/2. In the worst case, the average number of buckets searched is directly proportional to N (in this case, half of N).
I have just demonstrated that in the worst case, the average number of buckets inspected is O(N). I use algorithmic analysis here to estimate the count of buckets (not performance time or number of operations) since the runtime performance for get()
and put()
is directly related to the number of buckets inspected.
This is a stalemate. You can increase the allocated table
size, M, to be much greater than the number of keys to be inserted, N, and reduce the number of collisions and the overall time to execute. But if you don’t plan correctly—that is, if N becomes closer and closer to M—you could quickly have horrible performance: even worse, the table
can still fill up, preventing more than M – 1 entries from being stored.
Table 3-2 compares the performance of inserting N entries into a Hashtable
of size M that uses open addressing to resolve conflicts. Observe the following:
For a small value of N, say 32, the average cost is nearly the same
(read the values from left to right in that row) regardless of the size, M, of the Hashtable
, because M is so much larger than N.
For any Hashtable
of size M, the average time to insert N keys
consistently increases as N increases (read the numbers from top to
bottom in any column).
If you look “diagonally southeast” in the table, you will see that the timing results are more or less the same. In other words, if you want the average performance of inserting 2 × N keys to be the same as inserting N keys, then you need to double the initial size of the Hashtable
.
8,192 | 16,384 | 32,768 | 65,536 | 131,072 | 262,144 | 524,288 | 1,048,576 | |
---|---|---|---|---|---|---|---|---|
32 |
|
|
|
|
|
|
|
|
64 |
|
|
|
|
|
|
|
|
128 |
|
|
|
|
|
|
|
|
256 |
|
|
|
|
|
|
|
|
512 |
|
|
|
|
|
|
|
|
1,024 |
|
|
|
|
|
|
|
|
2,048 |
|
|
|
|
|
|
|
|
4,096 |
|
|
|
|
|
|
|
|
8,192 |
|
|
|
|
|
|
|
|
16,384 |
|
|
|
|
|
|
|
|
This working implementation of the symbol table data type only becomes efficient when the available storage is noticeably larger than the number of keys to be inserted. If you misjudge the total number of keys in the symbol table, then performance will be surprisingly inefficient, sometimes 100 times slower. Worse, it is not really useful since I have not yet added the capability to remove a key from the symbol table. To overcome these limitations, I need to introduce the linked list data structure.
I now modify Hashtable
to store an array of linked lists in a technique known as separate chaining. Where linear probing looked for an empty bucket in which to place an entry, separate chaining stores an array of linked lists where each linked list contains entries whose keys compute to the same hash code. These linked lists are formed from LinkedEntry
nodes, shown in Listing 3-5.
LinkedEntry
node structure to support linked list of (key, value) pairsclass
LinkedEntry
:
def
__init__
(
self
,
k
,
v
,
rest
=
None
)
:
self
.
key
=
k
self
.
value
=
v
self
.
next
=
rest
More accurately, table[idx]
refers to the first LinkedEntry
node in a linked list of nodes whose keys share the same hash code value of
idx
. For convenience, I still refer to the M buckets, which can be empty (in which case table[idx] = None
) or contain the first LinkedEntry
node in a linked list.
The chain concept, introduced earlier for open addressing, is more clearly visible with linked lists: the length of the linked list is the length of the chain.
I continue to use hash(key) % M
to compute the hash code for an entry to be inserted. All entries with the same hash code exist within the same linked list. In Figure 3-3, the table
array has seven buckets, and therefore seven potential linked lists; once again, only the keys are shown.
Hashtable
linked list storage after adding five (key, value) pairsFigure 3-3 adds the (key, value) pairs in the same order as Figure 3-2: the first three entries with keys 20, 15, and 5 create three linked lists, one for each of the buckets associated with the hash code for these keys. Buckets shaded in gray are empty. Adding an entry whose key is 26 causes a collision in table[5]
, so a new entry is prepended to the beginning of this linked list, resulting in a linked list with two entries. When adding the final entry whose key is 19, it is also prepended to the linked list in table[5]
, resulting in a linked list with three entries.
Pay attention to how the newest entry added to a bucket becomes the first entry in that bucket’s linked list. Once put()
determines that the entry’s key does not appear in the linked list, it is efficient to just prepend a new LinkedEntry
node at the beginning of the linked
list. When the next
reference is None
, that entry is the last
one in the linked list. Listing 3-6 shows the modifications to Hashtable
.
Hashtable
class
Hashtable
:
def
__init__
(
self
,
M
=
10
)
:
self
.
table
=
[
None
]
*
M
self
.
M
=
M
self
.
N
=
0
def
get
(
self
,
k
)
:
hc
=
hash
(
k
)
%
self
.
M
entry
=
self
.
table
[
hc
]
while
entry
:
if
entry
.
key
==
k
:
return
entry
.
value
entry
=
entry
.
next
return
None
def
put
(
self
,
k
,
v
)
:
hc
=
hash
(
k
)
%
self
.
M
entry
=
self
.
table
[
hc
]
while
entry
:
if
entry
.
key
==
k
:
entry
.
value
=
v
return
entry
=
entry
.
next
self
.
table
[
hc
]
=
LinkedEntry
(
k
,
v
,
self
.
table
[
hc
]
)
self
.
N
+
=
1
The get()
and put()
functions have nearly identical structures, with a while
loop that visits each LinkedEntry
node in the linked list. Starting with the first LinkedEntry
in table[hc]
, the while
loop visits each node exactly once until entry
is None
, which means all nodes were visited. As long as entry
is not None
, the entry.key
attribute is inspected to see if there is an exact match with the k
parameter. For get()
, the associated entry.value
is returned, while for put()
, this value is overwritten with v
. In both cases, in the worst case, the while
loop will terminate once all entries in the linked list
have been seen. When get()
exhausts the linked list, it returns None
since no entry was found; put()
prepends a new entry (adds it at the beginning of the list), since it didn’t exist before.
put(k, v)
only adds a new entry to a linked list after checking that none of the existing entries in the linked list have k
as its key.
To evaluate the performance of this linked list structure, I need to count both the number of times a bucket is accessed as well as the number of times an entry node is inspected. It turns out that the performance of this linked list implementation is nearly identical to the open addressing implementation: the primary improvement is that there is no limit to the number of entries you can store in a linked list implementation. However, as I will discuss later in this chapter, if the number of entries, N, far exceeds the number of buckets, M, then the performance will degrade significantly. You also must consider that the memory requirement for storing the next references in the linked lists is twice as much as for open addressing.
Linked lists are versatile and dynamic data structures that can be extended, shortened, and spliced together efficiently because they do not rely on a fixed-size block of sequential memory. You can’t remove an index position from an array, but you can remove a node from a linked list.
Let’s break it down into two cases using a linked list containing three
entries whose key values are shown in Figure 3-4. Let’s say you want to delete the entry whose key is 19. This is the first
node in the linked list. To remove this entry, simply set the value of
first
to be first.next
. The modified linked list now has two entries, starting with 26.
To delete any other entry
(say, the one whose key is 26), look through the list, as shown in Figure 3-5. Stop when you find an entry with that key value, but keep a reference to the prior node, prev
, during your search. Now set the value of prev.next
to be entry.next
. This cuts out this middle node and results in a linked list containing one fewer node. Note that this will also
handle the case when entry
is the final entry in the linked list, since then prev.next
is set to None
.
In both cases, the memory for the removed node is reclaimed. Listing 3-7 contains the remove(k)
method in the Hashtable
linked list implementation that removes the entry with the given key, should it exist.
remove()
function for separate chaining implementation of Hashtable
def
remove
(
self
,
k
)
:
hc
=
hash
(
k
)
%
self
.
M
entry
=
self
.
table
[
hc
]
prev
=
None
while
entry
:
if
entry
.
key
==
k
:
if
prev
:
prev
.
next
=
entry
.
next
else
:
self
.
table
[
hc
]
=
entry
.
next
self
.
N
-
=
1
return
entry
.
value
prev
,
entry
=
entry
,
entry
.
next
return
None
self.table[hc]
refers to the first entry in the linked list associated with hash code hc
.
Continue iterating as long as there are entries in this linked list.
Locate entry to remove by comparing target k
with the key
field for entry
.
When found, if there is a prev
reference, link around entry
, which removes it from the list.
If there is no prev
reference, then entry
is first. Set the linked list at self.table[hc]
to point to the second node in the linked list.
Decrement count of entries, N. It is common to return the value associated with the entry being removed.
If key is not found, continue iteration by setting prev
to entry
and advancing entry
to the next entry.
By convention, the remove(k)
function returns the value that had been associated with k
or None
if k
was not present.
I now have two different structures that provide a symbol table data type to store (key, value) pairs. The linked list implementation has the added benefit of allowing (key, value) pairs to be removed, so you must choose that structure if you need this functionality. If you only add entries to a symbol table, however, you still need to evaluate the efficiency of these two approaches.
Let’s first evaluate the storage requirements to store N (key, value)
pairs. Both approaches create an array of size M to hold entries. However, in the linked list approach, N can grow to be as large as necessary; for open addressing, N must be strictly smaller than M, so
you must be sure to choose a large enough M in advance. The size of the
memory requirements for table
is directly proportional to M.
Ultimately, there will be N entries inserted into the symbol table. Each Entry
in open addressing stores only the (key, value) pair, while the LinkedEntry
for the linked list approach stores an additional next
reference for each entry. Since each reference is a fixed memory size, the extra storage is directly proportional to N.
Open addressing requires storage proportional to both M and N, but since N < M, I can simply say that the storage is O(M).
Separate chaining requires storage proportional to both M and N, but since there are no restrictions on N, the storage is O(M + N).
To evaluate runtime performance, the key operation to count is the number of times an entry is inspected. Let’s start with the worst case, which occurs when the computed hash code for all keys is the same. In the linked list implementation, the table
array contains M – 1 unused index positions, and a single linked list contains all N (key, value) pairs. The key to be searched might be the final entry in the linked list, so the get()
performance in the worst case would be directly proportional to N. The situation in open addressing is similar: there would be N consecutive entries within the M-sized array, and the one you are looking for is the last one. I can safely say that regardless of implementation choice, in the worst case, get()
is O(N).
This might seem like a deal-breaker, but it turns out that the mathematical hash functions do a really good job of distributing the hash code values for keys; as M increases, the probability of collisions decreases. Table 3-3 describes a head-to-head comparison of these two approaches by inserting N = 321,129 words from an English dictionary into a hashtable whose size, M, varies from N/2 to 2 × N. It also includes results for M = 20 × N (first row) and smaller values of M (the last five rows).
Table 3-3 shows two pieces of information for each (M, N) pair:
The average length of each nonempty chain in the Hashtable
. This
concept is applicable whether open addressing or linked lists are used.
The size of the maximum chain in the Hashtable
. If the open addressing Hashtable
becomes too congested—or linked lists become excessively long for certain hash codes—the runtime performance suffers.
Linked list |
Open addressing |
|||
---|---|---|---|---|
M |
Avg. chain len |
Max chain len |
Avg. chain len |
Max chain len |
6,422,580 |
|
|
|
|
… |
|
|
|
|
642,258 |
|
|
|
|
610,145 |
|
|
|
|
579,637 |
|
|
|
|
550,655 |
|
|
|
|
523,122 |
|
|
|
|
496,965 |
|
|
|
|
472,116 |
|
|
|
|
448,510 |
|
|
|
|
426,084 |
|
|
|
|
404,779 |
|
|
|
|
384,540 |
|
|
|
|
365,313 |
|
|
|
|
347,047 |
|
|
|
|
329,694 |
|
|
|
|
313,209 |
|
|
|
|
… |
|
|
|
|
187,925 |
|
|
|
|
112,755 |
|
|
|
|
67,653 |
|
|
|
|
40,591 |
|
|
|
|
24,354 |
|
|
|
The values in the table all increase as the size of M decreases; this
occurs because smaller Hashtable
sizes lead to more collisions,
producing longer chains. As you can see, however, open addressing degrades much more quickly, especially when you consider that there are some hash codes whose chain extends to hundreds of entries to be inspected. Worse, once M is smaller than N, it becomes impossible to use open addressing (shown in the table as Fail
). In contrast, the statistics for the linked list implementation appear to be almost immune from this situation. If the size, M, of a hashtable is much larger than N—for example, twice as large—then the average chain length is very close to 1, and even the maximum chain length is quite small. However, M has to be decided in advance, and if you use open addressing, you will run out of room once N = M – 1.
Things are much more promising with separate chaining. As you can see in Table 3-3, even when N is more than ten times the size of the hashtable, M, linked lists can grow to accommodate all of the entries, and the performance doesn’t suffer nearly as much as open addressing does when N became closer and closer to M. This is evident from the maximum chain length of the linked lists shown in Table 3-3.
These numbers provide the information to develop a strategy to ensure the efficiency of a hashtable whose initial size is M. The performance of a hashtable can be measured by how “full” it is—which can be determined by dividing N by M. Mathematicians have even defined the term alpha to represent the ratio N/M; computer scientists refer to alpha as the load factor of a hashtable.
For separate chaining, alpha represents the average number of keys in each linked list and can be larger than 1, limited only by the available memory.
For open addressing, alpha is the percentage of buckets that are occupied; its highest value is (M – 1)/M, so it must be smaller than 1.
Years of research have identified that hashtables become increasingly inefficient once the load factor is higher than 0.75—in other words, once an open addressing hashtable is ¾ full.7 For separate chaining hashtables, the concept still applies, even though they do not “fill up” in the same way.
Figure 3-6 plots the average chain size (shown in
diamonds using the left Y-axis) and maximum chain size (shown in squares using the right Y-axis) after inserting N = 321,129 words into a Hashtable
of size M (found on the X-axis). This graph effectively shows how you can compute a suitable M value to ensure a desired average (or maximum) chain size if you know the number of keys, N, to be inserted.
If the hashtable could only grow larger—that is, increase its M value—then the load factor would reduce and the hashtable would be efficient again. I next show how to accomplish this, with a little bit of effort.
The DynamicHashtable
in Listing 3-8 uses a
load_factor
of 0.75 to set a threshold
target.
load_factor
and threshold
when creating DynamicHashtable
class
DynamicHashtable
:
def
__init__
(
self
,
M
=
10
)
:
self
.
table
=
[
None
]
*
M
self
.
M
=
M
self
.
N
=
0
self
.
load_factor
=
0.75
self
.
threshold
=
min
(
M
*
self
.
load_factor
,
M
–
1
)
What if I simply doubled the size of the storage array, using a resize strategy known as geometric resizing? More precisely, double the array size and add 1 when resizing.8 Once the number of (key, value) pairs is larger than or equal to threshold
, the table
storage array needs to grow in size to remain efficient. The revised put()
method for separate chaining is shown in Listing 3-9.
put()
method initiates resize()
def
put
(
self
,
k
,
v
)
:
hc
=
hash
(
k
)
%
self
.
M
entry
=
self
.
table
[
hc
]
while
entry
:
if
entry
.
key
==
k
:
entry
.
value
=
v
return
entry
=
entry
.
next
self
.
table
[
hc
]
=
LinkedEntry
(
k
,
v
,
self
.
table
[
hc
]
)
self
.
N
+
=
1
if
self
.
N
>
=
self
.
threshold
:
self
.
resize
(
2
*
self
.
M
+
1
)
To increase storage in most programming languages, you need to allocate a new array and copy over all of the original entries from the first array into the second array, as shown in Figure 3-7. This figure shows the result after resizing the array storage for the array of linked lists as well as open addressing.
You should first observe that this copy operation will take time that is directly proportional to M—the greater the size of the hashtable storage array, the more elements need to be copied—so this operation can be classified as O(M). Simply copying the entries won’t actually work, however, since some entries—such as the ones with 19 and 26 as keys—will no longer be findable.
If you went to look for key 19, its hash code would now be 19 % 15 = 4,
and that bucket is empty in both structures, indicating that no entry with a key of 19 exists in the hashtable. In the former open addressing
Hashtable
with size M = 7, key 19 had been placed in bucket 2
because linear probing wrapped around the end of the old array (of size
7) to the beginning when it was inserted. Now that the array has 15
elements, the wraparound doesn’t happen, so this key can no longer be
found.
By pure coincidence, some entries are still findable, and these are highlighted in Figure 3-7. In open addressing, the entry with key 20 is not in table[5]
, but linear probing will find it in table[6]
; similarly, the entry with key 15 is not in table[0]
, but linear probing will find it in table[1]
. The entry with key 5 is findable in both open addressing and separate chaining because its hash code remains the same.
What can I do to avoid losing keys? The proper solution, shown in Listing 3-10, is to create a new temporary Hashtable
with twice the original storage (technically,
2M + 1) and rehash all entries into the new Hashtable
. In other words, for each of the (k
, v
) entries in the original Hashtable
, call put(k,v)
on the new temporary Hashtable
. Doing so will guarantee these entries will remain findable. Then, with a nifty programming trick, the underlying storage array from temp
is stolen and used as is, whether for separate chaining or for open addressing.
def
resize
(
self
,
new_size
)
:
temp
=
DynamicHashtable
(
new_size
)
for
n
in
self
.
table
:
while
n
:
temp
.
put
(
n
.
key
,
n
.
value
)
n
=
n
.
next
self
.
table
=
temp
.
table
self
.
M
=
temp
.
M
self
.
threshold
=
self
.
load_factor
*
self
.
M
This code is nearly identical for resizing with open addressing. I now
have a strategy for dynamically increasing the size of a
Hashtable
. Figure 3-8 contains the proper resized hashtables for the open addressing example in Figure 3-2 and the separate chaining example in Figure 3-3.
How well does the resizing logic perform? Let’s conduct an experiment running 25 repeated trials using different M values to measure:
The time it takes to add N = 321,129 keys into a Hashtable
with
initial size M that doubles in size when threshold
is exceeded.
The time to locate all N of these words once all keys are inserted.
For separate chaining and open addressing hashtables, Table 3-4 compares the build times and access times for dynamically resizable hashtables starting with different initial size, M, ranging from 625 to 640,000. This table also shows the performance of nongrowing hashtables with an initial size of M = 428,172. This will result in a fair comparison, since N = 321,129, or 428,172 × 0.75.
Linked list | Linked list | Open addressing | Open addressing | |
---|---|---|---|---|
M | Build time | Access time | Build time | Access time |
625 |
|
|
|
|
1,250 |
|
|
|
|
2,500 |
|
|
|
|
5,000 |
|
|
|
|
10,000 |
|
|
|
|
20,000 |
|
|
|
|
40,000 |
|
|
|
|
80,000 |
|
|
|
|
160,000 |
|
|
|
|
320,000 |
|
|
|
|
640,000 |
|
|
|
|
… |
|
|
|
|
Fixed |
|
|
|
|
The last row presents the ideal case, where the load factor of the
Hashtable
does not exceed 0.75. You should expect open addressing to
require more build time, because collisions on one bucket inevitably affect other buckets, whereas separate chaining localizes all collisions within the same bucket.
The other remaining rows show that you don’t have to magically predict the initial M to use, since the access time (to inspect all 321,129 keys) is essentially the same.
In the worst case, both put()
and get()
for Hashtable
is O(N). As I have explained before, if the hash code for each key computes to exactly the same bucket (for both separate chaining and open addressing), the time to complete each operation is directly proportional to N, the number of keys in the Hashtable
.
However, under the commonly agreed-upon assumption of simple uniform hashing, hash functions will uniformly distribute keys into the buckets of the hashtable: each key has an equal probability of being placed in any bucket. From this assumption, mathematicians have proved that the average length of each chain is N/M. The takeaway is that you can rely on the experts who have developed the Python hash()
function.
Since you can always guarantee that N < M when hashtables can grow, I can say for certain that the average quantity N/M is a constant, O(1
), and is independent of N.
A search can either hit (the key is in the hashtable) or miss (it is not). In open addressing (assuming uniform key distribution), on a hit the average number of buckets to inspect is (1 + 1/(1 – alpha))/2. This works out to 2.5 when alpha = 0.75. For a search miss, the result is (1 + 1 / (1 – alpha)2)/2. This works out to 8.5 under the same assumption.
I need to account for the extra costs of resizing the hashtable, working under the assumption that the threshold load factor is 0.75 and that the underlying storage doubles in size using geometric resizing. To start things off, let’s assume M is 1,023 and that N is much larger than M: I’ll use the 321,129-word English dictionary again. I need to count the number of times each key is inserted into a hashtable (including the temporary one in resize
).
The first resize is triggered when the 768th key is added (since 768 is ≥ 767.25 = 0.75 × 1,023), which grows the hashtable to have M = 1,023 × 2 + 1, or 2,047. During this resize, 768 keys are rehashed and inserted. Note that immediately after the resize, the load factor is reduced in half to 768/2047, which is approximately 0.375.
When an additional 768 keys are inserted, all 1,536 keys are rehashed into a new hashtable of size M = 2,047 × 2 + 1, or 4,095. The hashtable is resized a third time when an additional 1,536 keys are inserted, which causes the existing 3,072 keys to be inserted into a hashtable of size M = 8,191.
To make sense of these numbers, Table 3-5 shows the resizing moments when the Nth word is inserted, together with the accumulated total number of times any key is inserted into the hashtable. During resize events, the final column shows that the average number of insertions (by dividing the total number of insertions by N) converges to 3. Even though resizing forces each key to be reinserted with a geometric resizing strategy, this table demonstrates you will never need more than three times as many insertions than without resizing.
Word | M | N | # insert | average |
---|---|---|---|---|
absinths |
|
|
|
|
accumulatively |
|
|
|
|
addressful |
|
|
|
|
aladinist |
|
|
|
|
anthoid |
|
|
|
|
basirhinal |
|
|
|
|
cincinnatian |
|
|
|
|
flabella |
|
|
|
|
peps |
|
|
|
|
… |
|
|
|
|
zyzzyvas |
|
|
|
|
The key observation is that geometric resizing ensures that resize events occur significantly less frequently as the size of the table grows. In Table 3-5, once the final word is inserted into the hashtable—which is not a resize event—the average has dropped to 2.22, and it will not need to be resized again until another 72,087 keys are added. As you can see, this is almost 100 times less frequently than when it started (when a resize was triggered after just 768 were added).
The result of this last analysis is that the average cost of inserting all 321,129 entries into a dynamically resizing hashtable is no more than three times what it would cost if the hashtable had somehow been large enough to store all N keys. As you have seen in Chapter 2, this is just a multiplicative constant, which will not change the performance classification of the average case for put()
: it remains O(1
) even with extra work due to resizing.
If you know the collection of N keys in advance, then you can use a technique known as perfect hashing to construct an optimal hashtable where the hash code for each key is a unique index location. Perfect hashing generates the Python code containing the hash function to use. This is an unexpected result, and there are perfect hash generators for numerous programming languages.
If you install the third-party Python library perfect-hash
, it can
generate a perfect_hash()
function from an input file containing the
desired keys.9 Listing 3-11 contains the
generated code using the words “a rose by any other name would smell as
sweet.”
G
=
[
0
,
8
,
1
,
4
,
7
,
10
,
2
,
0
,
9
,
11
,
1
,
5
]
S1
=
[
9
,
4
,
8
,
6
,
6
]
S2
=
[
2
,
10
,
6
,
3
,
5
]
def
hash_f
(
key
,
T
):
return
sum
(
T
[
i
%
5
]
*
ord
(
c
)
for
i
,
c
in
enumerate
(
key
))
%
12
def
perfect_hash
(
key
):
return
(
G
[
hash_f
(
key
,
S1
)]
+
G
[
hash_f
(
key
,
S2
)])
%
12
The enumerate()
built-in Python function improves how you iterate over
lists when you also need positional information.
>>>
for
i
,
v
in
enumerate
([
'g'
,
't'
,
'h'
]):
(
i
,
v
)
0
g
1
t
2
h
enumerate()
iterates over each value in a collection and additionally
returns the index position.
Recall from Figure 3-1 how I defined a list, day_array
, and a supporting base26()
hash function that worked with twelve months?
Perfect hashing pursues the same approach in a more elegant way, processing the N strings to create the lists G
, S1
, and S2
and a supporting hash_f()
function.
To compute the index location for the string 'a'
, you need two
intermediate results; recall that ord('a')
= 97:
hash_f('a', S1)
= sum
([ S1[0]
× 97 ]) % 12. Since S1[0]
= 9, this is the value (9 × 97) % 12 = 873 % 12 = 9.
hash_f('a', S2)
= sum
([ S2[0]
× 97 ]) % 12. Since S2[0]
= 2, this is the value (2 × 97) % 12 = 194 % 12 = 2.
The value returned by perfect_hash('a')
is (G[9] + G[2])
% 12 = (11 + 1) % 12 = 0. This means that the hash code for the string 'a'
is
0. Repeat this computation10 for the string 'by'
and you will find that:
hash_f('by', S1)
= (9 × 98 + 4 × 121) % 12 = 1,366 % 12 = 10.
hash_f('by', S2)
= (2 × 98 + 10 × 121) % 12 = 1,406 % 12 = 2.
perfect_hash('by')
is (G[10] + G[2])
% 12 = (1 + 1) % 12 = 2.
To summarize, the key 'a'
hashes to index location 0, and the key 'by'
hashes to index location 2. In fact, each of the words in “a rose by any other name would smell as sweet” is hashed to a different computed index location. It is sometimes truly amazing what mathematics can do!
Listing 3-12 contains the perfect_hash()
function for the
321,129 words in the sample dictionary. This function computes 0 for
the first English word, 'a'
, and 321,128 for the last English word,
'zyzzyvas'
. It is supported by a large list, G
, containing 667,596
values (not shown, obviously!) and two intermediate lists, S1
and S2
.
For the string 'by'
in this larger perfect hashtable, you can confirm
the following:
hash_f('by', S1)
= (394,429 × 98 + 442,829 × 121) % 667,596 = 92,236,351 % 667,596 = 108,103
hash_f('by', S2)
= (14,818 × 98 + 548,808 × 121) % 667,596 = 67,857,932 % 667,596 = 430,736
perfect_hash('by')
= (G[108,103
] + G[430,736
]) % 667,596 = (561,026 + 144,348) % 667,596 = 37,778
S1
=
[
394429
,
442829
,
389061
,
136566
,
537577
,
558931
,
481136
,
337378
,
395026
,
636436
,
558331
,
393947
,
181052
,
350962
,
657918
,
442256
,
656403
,
479021
,
184627
,
409466
,
359189
,
548390
,
241079
,
140332
]
S2
=
[
14818
,
548808
,
42870
,
468503
,
590735
,
445137
,
97305
,
627438
,
8414
,
453622
,
218266
,
510448
,
76449
,
521137
,
259248
,
371823
,
577752
,
34220
,
325274
,
162792
,
528708
,
545719
,
333385
,
14216
]
def
hash_f
(
key
,
T
):
return
sum
(
T
[
i
%
24
]
*
ord
(
c
)
for
i
,
c
in
enumerate
(
key
))
%
667596
def
perfect_hash
(
key
):
return
(
G
[
hash_f
(
key
,
S1
)]
+
G
[
hash_f
(
key
,
S2
)])
%
667596
The computation in perfect_hash(key)
in Listing 3-12 produces a large sum that is reduced using % 667,596 to identify a unique location from the large G
list. As long as key
is a valid English word from the dictionary, perfect_hash(key)
uniquely identifies an index from 0 to 321,128.
If you inadvertently attempt to hash a key that is not an English word, a collision will occur: the word 'watered'
and the nonword
'not-a-word'
both hash to the index location 313,794. This is not an
issue for perfect hashing, since the programmer is responsible for ensuring that only valid keys are ever hashed.
Hashtables are designed for efficient get(k)
and put(k,v)
operations. It might also be useful to retrieve all entries in a hashtable, whether it uses open addressing or separate chaining.
Python generators are one of the best features of the language. Most programming languages force programmers to return the values in a collection using extra storage. In Chapter 2, I explained how range(0, 1000)
and range(0, 100000)
both use the same amount of memory while returning all integers in the respective ranges; generators make this possible.
The following generator function produces the integers in the range from 0 to n
that do not include the given digit
:
def
avoid_digit
(
n
,
digit
):
sd
=
str
(
digit
)
for
i
in
range
(
n
):
if
not
sd
in
str
(
i
):
yield
i
To give an object this same capability in Python, a class can
provide an __iter__()
method that allows the caller to use the for v in
object
idiom.
The two __iter__()
implementations in Listing 3-13 are designed for separate chaining and open addressing hashtables.
# Iterator for Open Addressing Hashtable
def
__iter__
(
self
)
:
for
entry
in
self
.
table
:
if
entry
:
yield
(
entry
.
key
,
entry
.
value
)
# Iterator for Separate Chaining Hashtable
def
__iter__
(
self
)
:
for
entry
in
self
.
table
:
while
entry
:
yield
(
entry
.
key
,
entry
.
value
)
entry
=
entry
.
next
To demonstrate how these iterators work, construct two hashtables with M equal to 13—one using open addressing and one using separate chaining—and a third hashtable using perfect hashing. After inserting the words in the string “a rose by any other name would smell as sweet,” Table 3-6 shows the words generated by the respective hashtables.
Open addressing | Separate chaining | Perfect hash |
---|---|---|
a |
sweet |
a |
by |
any |
any |
any |
a |
as |
name |
would |
by |
other |
smell |
name |
would |
other |
other |
smell |
as |
rose |
as |
name |
smell |
sweet |
by |
sweet |
rose |
rose |
would |
The words returned for the open addressing and separate chaining hashtables appear to be in a random order; of course, this isn’t randomness but based solely on the way keys are hashed. If you execute the sample code that produces this table, your ordering for open addressing and separate chaining will likely be different, because the hash()
code values for strings in Python 3 are unpredictable.
One nice feature of the perfect-hash
library is that the index positions computed by perfect_hash(key)
are based on the order of the words used when generating the perfect hashing code. Simply use a list of strings that are already in sorted order, and the entries will be stored in sorted order, and the iterator will yield the pairs in the same order.
Chapter 8 contains more details of the Python dict
type. Instead of
implementing a symbol table from scratch, as I have done in this chapter, you should always use dict
because it is a built-in type and will be significantly more efficient than the code I provide.
In this chapter, I introduced several key concepts:
The linked list data structure can store small collections of items efficiently, allowing for dynamic insertion and removal.
Symbols tables use a storage array with M buckets to store entries. There needs to be a strategy to resolve collisions when two or more keys hash to the same bucket.
Open addressing relies on distributing entries within an array to reduce the number of collisions, but this only works efficiently when the size of the storage array, M, is more than twice as large as the number of stored entries, N. See the challenge exercises for two approaches that support removing keys.
Separate chaining uses linked lists to store entries whose keys hash to the same bucket. This approach makes it easier to support a remove operation for keys.
Designing hash functions is hard—use the predefined ones designed by Python’s designers.
Geometric resizing ensures a symbol table remains efficient by decreasing the frequency of future resize events.
Perfect hashing can be used to construct hash functions that avoid
collisions by computing a unique bucket index location for a fixed number of keys; the hash function is often more computationally expensive than a default hash()
function.
Does open addressing improve if you use a different strategy for resolving collisions? Instead of using a linear probe with a delta of 1 (wrapping around), create a hashtable whose size is a power of 2, and use a probe sequence that explores additional index positions using deltas that are triangle numbers, that is, the integers 1, 3, 6, 10, 15, 21, and so on; you will still have to wrap around. The nth triangle number is the sum of the integers from 1 to n, represented by the formula n × (n + 1)/2.
Is the overall performance better than linear probing?
Conduct an experiment where you fill a Hashtable
of size 524,288 with the first 160,564 English words (for a utilization of 30%), and then measure the time it takes to search for all 321,129 words. Compare this performance against an open addressing Hashtable
and the separate chaining variation.
Is it worth sorting the keys in the linked lists of a separate chaining hashtable? Construct a Hashtable
variant that sorts the (key, value) pairs in each linked list in ascending order by key.
Conduct an experiment where you measure the time it takes to construct the initial Hashtable
of size 524,287 (a prime number) with the first 160,564 English words (for a utilization of 30%) in reverse order. As you will see, the worst case for this variation occurs when the keys are put()
into the hashtable in increasing value. Compare this performance against an open addressing Hashtable
and the regular separate chaining Hashtable
.
Now measure the time it takes to search the first 160,564 English words as keys. As you might expect, this is the best case, since all these words are in the hashtable. Compare this performance against the array-based Hashtable
and the separate chaining variation. Next search for the remaining 160,565 words in the back half of the English dictionary. This provides a glimpse of the worst case since finding these words in a linked list will always require each linked list to be fully explored. Once again, compare this performance against the open addressing Hashtable
and the regular unordered linked list variation.
How sensitive are these results to the chosen initial size, 524,287? For example, compare with 214,129 (a utilization of 75%) and 999,983 (a utilization of 16%).
To see the dangers in predictable hash codes, consider the
ValueBadHash
code in Listing 3-14. The objects for this Python class hash to just four different values (0 to 3). This class overrides the default behavior of hash()
as well as __eq__()
, so these objects can be used as a key when invoking put(key, v)
.
ValueBadHash
has a horrible hash()
functionclass
ValueBadHash
:
def
__init__
(
self
,
v
):
self
.
v
=
v
def
__hash__
(
self
):
return
hash
(
self
.
v
)
%
4
def
__eq__
(
self
,
other
):
return
(
self
.
__class__
==
other
.
__class__
and
self
.
v
==
other
.
v
)
Construct a separate chaining Hashtable(50,000)
and invoke
put(ValueBadHash(w), 1)
for the first 20,000 English words in the
dictionary. Next, create a regular separate chaining Hashtable
, and invoke put(w, 1)
for these same words. Generate statistics on:
The average chain length of a bucket
The maximum chain length of any bucket in Hashtable
Be prepared for the code execution to take a long time! Explain these statistics.
Having a prime number for the number of buckets, M, is helpful in practice because every key that shares a common factor with M will be hashed to a bucket that is a multiple of this factor. For example, if M = 632 = 8 × 79 and the entry to be inserted has a key of 2,133 = 27 × 79, the hash code is 2,133 % 632 = 237 and 237 = 79 × 3. The problem is that the performance of a hashtable assumes uniform distribution of keys, which is violated when some keys are predisposed to be placed in certain buckets.
To demonstrate the impact of M, form a key from the base26
representation of each of the words in the 321,129-word English dictionary. In the range from 428,880 to 428,980 containing 101 potential values for M, construct fixed-size hashtables (using both open addressing and separate chaining) and produce a table showing the average and maximum chain length size. Are there any values of M in this range that are particularly bad? Can you find out what these values all have in common? Armed with this knowledge, scan the next 10,000 higher values of M (up to 438,980) to try to find one whose maximum chain length is almost ten times as bad.
When using an open addressing hashtable, there is no easy way to support remove(key)
because removing an entry from a bucket may break an existing chain created using linear probing. Imagine you have a Hashtable
with M = 5 and you hashed entries with key values of 0, 5, and then 10. The resulting table
array would contain the following keys [0
, 5
, 10
, None
, None
], since each collision would be resolved using linear probing. A flawed remove(5)
operation would simply remove the entry with key 5 from table
, resulting in an array storing [0
, None
, 10
, None
, None
]. However, the entry with key 10 is no longer findable because the chain at index location 0 was broken.
One strategy is to add a Boolean field to the Entry
that records whether the entry has been deleted or not. You have to modify get()
and put()
accordingly. In addition, the resize event needs to be adjusted as well, since entries marked as deleted will not need to be inserted into the new hashtable. Don’t forget to update the __iter__()
method to skip deleted entries.
Consider adding logic to initiate a shrink event when more than half of
the entries in the hashtable are marked for deletion. Run some trials to compare the performance of separate chaining vs. open addressing now that remove()
is available for open addressing.
Resizing a hashtable incurs a steep penalty since all N values need to be rehashed into the new structure. In incremental resizing, instead, the resize event allocates a new array, newtable
(with size 2M + 1), but the original hashtable remains. A get(key)
request first searches for key in newtable
before searching through the original table
. A put(key,value)
request inserts a new entry into newtable
: after each insertion, delta
elements from the old table
are rehashed and inserted
into newtable
. Once all elements are removed from the original table
, it can be deleted.
Implement this approach with a separate chaining hashtable, and experiment with different values of delta
. In particular, what is the smallest value of delta
that will ensure the old table is completely emptied before the next resize event? Is delta
a constant, or is it based on M or even N?
This approach will reduce the total cost incurred at any time by put()
; perform empirical measurements using an example similar to
Table 3-5, but now evaluating the runtime performance cost of the most expensive operation.
Should the number of (key, value) pairs, N, in a hashtable become less than ¼ of M, the table
storage array could shrink to reduce
unneeded storage space. This shrink event can be triggered by remove()
.
Modify either the separate chaining or open addressing hashtable to realize this capability, and run some empirical trials to determine if it is worthwhile.
Use a symbol table to find the element in a list that is duplicated the most number of times. In the case of a tie, any value that is repeated “the most number of times” can be returned.
most_duplicated([1,2,3,4])
can return 1, 2, 3, or 4, while
most_duplicated([1,2,1,3])
must return 1.
An open addressing hashtable can remove entries by rehashing all remaining (key, value) pairs in the chain after the one that is to be removed. Add this capability to an open addressing hashtable, and run some trials to compare the performance of separate chaining versus open addressing with this revised remove capability.
1 Throughout this chapter, the notation (key, value) represents a pair of information considered as a single unit.
2 Note that 263 = 17,576.
3 Originally based on the English alphabet, ASCII encodes 128 characters (typically those found on a typewriter keyboard) into seven-bit integers. Capitalization matters! ord('A')
= 65, for example, but ord('a')
= 97.
4 Java computes 32-bit hash function values, and sometimes two string keys have the exact same hash()
value; for example, both misused and horsemints hash to 1,069,518,484.
5 In Java, if hash(key)
is negative, then the %
operator will return a negative number, so the formula must be (key.hashCode() & 0x7fffffff) % M
to first convert the negative hash
computation into a positive integer before computing modulo M.
6 See https://oreil.ly/C4V0W to learn more and try the challenge exercise at the end of this chapter.
7 The Python dict
type uses ⅔ as the threshold.
8 It is commonly observed that Hashtables
whose number of buckets is a prime number work really well (see challenge exercise at end of this chapter); here the size is odd, which can also be helpful.
9 Use the Python pip
installer like this: pip
install perfect-hash
.
10 ord('b')
= 98, and ord('y')
= 121.