My goal in this book was to introduce you to the fundamental algorithms and the essential data types used in computer science. You need to know how to efficiently implement the following data types to maximize the performance of your code:
A linked list ensures O(1
) performance when adding a value. If you
choose to use an array, then you need to employ geometric resizing to
extend the size of the array so you can guarantee amortized O(1
)
performance over its average use (though you will still incur an O(N)
runtime performance on the infrequent resize events). Note that a bag
does not typically allow values to be removed, nor does it prevent
duplicate values from being added.
A linked list can store the values in a stack so push()
and pop()
have O(1
) runtime performance. The stack records the top
of the stack
to push and pop values.
A linked list can efficiently store a queue so enqueue()
and
dequeue()
have O(1
) runtime performance. The queue records
the first
and the last
node in the linked list to efficiently add
and remove values from the queue.
The open addressing approach for symbol tables is surprisingly efficient, with suitable hash functions to distribute the (key, value) pairs. You still need geometric resizing to double the size of the storage, thus effectively making these resize events less frequent.
The heap data structure can store (value, priority) pairs to support
enqueue()
and dequeue()
in O(log
N) runtime performance. In most
cases, the maximum number of values to store, N, is known in advance; if not,
however, the geometric resizing strategy can be used to minimize the
number of resize events.
Implementing this data type combines the heap data structure with a
separate symbol table that stores the index location of each value in the
heap. For graph algorithms, it is common to store only integer node
labels in the range from 0 to N – 1, where N is the maximum number of
values to be stored in the indexed min priority queue. In this case, the
separate symbol table can be implemented as an array for extremely
efficient look-up. It supports enqueue()
, dequeue()
, and
decrease_priority()
in O(log
N) runtime performance.
The adjacency matrix structure is appropriate when all possible edges in the graph exist, which is a common use case for algorithms that compute shortest distances. You can use a two-dimensional array to store an adjacency matrix if the nodes are represented using integers in the range from 0 to N – 1. In most cases, however, an adjacency list structure is more suitable for storing the graphs, and a symbol table is used to associate a bag of adjacent nodes for each node (or, alternatively, a bag of adjacent edges). Any home-built implementation of a graph is going to be insufficient in the long run. Instead, use NetworkX (or any other existing third-party library) to represent graphs efficiently.
The book preface contains a picture that summarizes each of these data types. Throughout the book I have demonstrated how different data structures can efficiently implement the data types, leading to the performance classifications in Table 8-1.
Data type | Operation | Classification | Discussion |
---|---|---|---|
Bag |
|
|
Use a linked list to store the values for a bag, since prepending a value to the beginning of the linked list is a constant time operation. |
|
|
||
|
|
||
Stack |
|
|
Use a linked list for a stack, pushing new values to the front of the linked list and popping values off the front. If an array is used for storage, constant time performance is possible, but the array might become full. |
|
|
||
|
|
||
Queue |
|
|
Use a linked list for a queue, storing
references to the |
|
|
||
|
|
||
Symbol table |
|
|
Use an array of M linked lists to
store N (key, value) pairs in amortized constant time. As more pairs are
added, use geometric resizing to double the size of M to increase
efficiency. With open addressing, a single contiguous array stores all
pairs, using linear probing to resolve conflicts. Iterators can return all
keys or values. If you also need to retrieve the keys in sorted order, then
use a binary search tree where each node stores a (key, value) pair but then
the performance of |
|
|
||
|
|
||
|
|
||
Priority queue |
|
|
A heap data structure
can store the (value, priority) pairs, using geometric resizing if storage
becomes full. The |
|
|
||
|
|
||
Indexed min priority queue |
|
|
Starting from a
heap data structure, store an additional symbol table to look up the location
in the heap of any value in O( |
|
|
||
|
|
||
|
|
The Python implementations are highly tuned and optimized after decades of use. It is impressive how the designers of the Python interpreter constantly seek new ways to gain even a small improvement in performance with each updated release. The Design and History FAQ for Python is worth a read.
The four built-in container types in Python are tuple
, list
, dict
,
and set
:
tuple
data typeA tuple
is an immutable sequence of values that can be processed like a
list
, except that none of its values can be altered. Use a tuple to
return multiple values from a function.
list
data typeThe built-in list
data type is the dominant data structure in
Python. It is extremely versatile, especially because of Python’s slicing
syntax that allows programmers to effortlessly work with iterables and
select subranges of a list
for processing. The list
is a general
purpose structure, implemented as a variable-length array that uses a
contiguous array of references to other values.
dict
data typeThe built-in dict
data type represents a symbol table that maps keys to values. All of the concepts in Chapter 3 continue to apply. The Python implementation uses open addressing to resolve collisions between keys and also uses storage arrays where M is a power of 2, which is different from how most hashtables are constructed. Each dict
internal storage has storage for at least M = 8. This is done so it can store five entries without causing a resize (whereas smaller values of M would require a resize too quickly). It is also optimized to deal with sparse hashtables where most of the storage array is empty. The dict
automatically resizes based on a load factor of ⅔.
Because Python is open source, you can always inspect the internal implementation.1 Instead of using linear probing, as described in Chapter 3, a collision at hash code hc
will subsequently select the next index to be ((5 × hc) + 1) % 2k
, where 2k
is M, or the size of the storage array. The implementation adds another layer of key distribution using a perturb
value that is added to hc
. It offers a case study showing how minor mathematical improvements to hash code computations can improve the efficiency of the implementation. Because Python uses a different probing technique, its hashtables can have sizes that are powers of 2, which can eliminate the use of modulo when computing hash codes. This can speed up processing dramatically because the Python interpreter uses a technique known as bit masking to perform computations modulo a power of 2.
Specifically, using the bitwise and (&) operator, N % 2k is equal to N & (2k – 1). Table 8-2 shows the difference when timing 10,000,000 computations. The improvement is more pronounced within the Python interpreter (typically written in C), where bitwise and is more than five times faster than using modulo.
Language | Computation | Performance |
---|---|---|
Python |
|
|
Python |
|
|
C |
|
|
C |
|
|
set
data typeA set
collection contains distinct hashable objects.2 It
is quite common to use a symbol table to act as a set, by simply mapping
each key to some arbitrary value, such as 1 or True
. The internal
Python implementation is heavily based on the dict
approach, but it is
worth mentioning that the use cases for a set
are quite different from
a dict
. Specifically, a set
is often used for membership testing, to
check to see whether a value is contained by the set—and the code is
optimized to handle both hits (i.e., the value is contained in the set)
and misses (i.e., the value is not contained in the set). In contrast,
when working with a dict
, it is much more common that the key exists
within the dict
.
A Python list
can represent a stack, offering an append()
function to
push new values to the end of the list (recall from
Table 6-1 that appending to the end of a list is efficient).
The list
type actually has a pop()
method that removes and returns the
last value in the list
.
Python provides a queue
module that implements multi-producer,
multi-consumer stacks that support both “Last-in, first-out” (LIFO)
behavior (as regular stacks should) and “First-in, first-out” (FIFO)
behavior (as you would expect with a queue). The
queue.LifoQueue(maxSize = 0)
constructor will return a stack that can store
a maximum number of values. This stack is meant to be used concurrently,
which means that attempting to push a value to a full stack will block the
execution until a value has been popped. Values are pushed with
put(value)
and popped with get()
.
Invoking the following sequence of commands will freeze the Python interpreter, and you will have to forcefully stop it:
import
queue
q
=
queue
.
LifoQueue
(
3
)
q
.
put
(
9
)
q
.
put
(
7
)
q
.
put
(
4
)
q
.
put
(
3
)
...
blocks
until
terminated
As it turns out, the fastest implementation is deque
(pronounced
DECK) from the collections
module, which stands for “double-ended
queue.” This data type allows values to be added to (or removed from) either end. The performance of LifoQueue
is about 30 times slower
than deque
, though both provide O(1
) runtime
performance.
A Python list
can also represent a queue, offering an append()
function
to enqueue new values to the end of the list. You will need to remove
the first element from the list
using pop(0)
to request removing the element at index position 0 in the list
. Using the list
in this fashion will become
seriously inefficient, as shown in Table 6-1: you must avoid
this at all costs.
The queue.Queue(maxSize = 0)
function constructs a queue that allows
maxSize
values to be enqueued, but this should not be your default queue
implementation. This queue provides multi-producer,
multi-consumer behavior, which means that attempting to enqueue a value to a
full queue will block the execution until a value has been dequeued.
Values are enqueued with put(value)
and dequeued with get()
. Invoking
the following sequence of commands will freeze the Python interpreter, and
you will have to forcefully stop it:
>>>
import
queue
>>>
q
=
queue
.
Queue
(
2
)
>>>
q
.
put
(
2
)
>>>
q
.
put
(
5
)
>>>
q
.
put
(
8
)
...
blocks
until
terminated
If you need a simple queue implementation, you could use the
queue.SimpleQueue()
function,3 which provides the behavior for a queue with a simplified
interface. Use put(value)
to enqueue value
to the end of the queue, and
get()
to retrieve the value at the head of the queue. This queue is much
more powerful than most programmers need, because it handles not only
thread-safe concurrent code, but also more complicated situations, such as
reentrant concurrent code, and this extra functionality comes with a
performance penalty. You should only use queue.SimpleQueue()
if you need
concurrent access to the queue.
The deque
is also the fastest implementation here. It is specially coded
for raw speed and is the queue implementation you should choose if speed
is critical. Table 8-3 shows that the deque
implementation
offers the best implementation; it also shows how list
provides O(N)
runtime performance for dequeuing a value. Once again, you must avoid using
a generic list
as a queue, since all other queue implementations offer
O(1
) runtime performance.
N | list | deque | SimpleQueue | Queue |
---|---|---|---|---|
1,024 |
|
|
|
|
2,048 |
|
|
|
|
4,096 |
|
|
|
|
8,192 |
|
|
|
|
16,384 |
|
|
|
|
The essential data types provided by Python are flexible enough to be used in a variety of ways, but you need to make sure you choose the appropriate data structures to realize the most efficient code.
Python provides a heapq
module providing a min binary heap, as described
in Chapter 4. Instead of using the 1-based indexing strategy I used in
Chapter 4, this module uses 0-based indexing.
The heapq.heapify(h)
function constructs a heap from a list,
h
, containing the initial values to place in the heap. Alternatively,
simply set h
to the empty list []
, and invoke heapq.heappush(h, value)
to add value
to the heap. To remove the smallest value from the heap, use
the heapq.heappop(h)
function. This heap implementation has two
specialized functions:
heapq.heappushpop(h, value)
Adds value
onto the heap and then removes and returns the smallest
value from heap
heapq.heapreplace(h, value)
Removes and returns the smallest value from the heap and also adds value
onto the heap
These functions are all applied directly to the parameter h
, making it
easy to integrate with your existing code. The contents of h
reflect the
array-based storage of a heap, as described in Chapter 4.
The queue.PriorityQueue(maxSize=0)
method constructs and returns a min
priority queue whose entries are added using put(item)
function calls, where item
is the tuple (priority
, value
). Retrieve the value with lowest priority using the get()
function.
There is no built-in indexed min priority queue, which is not surprising
because this data type is typically only needed for specialized graph
algorithms, such as Dijkstra’s algorithm (presented in Chapter 7). The
IndexedMinPQ
class I developed shows how to compose different data
structures together to achieve an efficient decrease_priority()
function.
This book has only scratched the surface of the incredibly rich field of algorithms. There are numerous avenues you can explore, including different application domains and algorithmic approaches:
Numerous real-world problems involve data sets with two-dimensional points, or even higher dimensions. Within this application domain, there are many algorithms that build from the standard techniques I’ve introduced—such as divide and conquer—and introduce their own data structures to efficiently solve problems. The most popular data structures include k-d trees, Quadtrees (for partitioning two-dimensional spaces), Octrees (for partitioning three-dimensional spaces), and R-trees for indexing multidimensional data sets. As you can see, the essential concept of a binary tree has been explored repeatedly in different application domains.
Floyd–Warshall is an example of dynamic programming applied to solving the single-source shortest path problem. There are many other algorithms that take advantage of dynamic programming. You can learn more about this technique in my Algorithms in a Nutshell book also published by O’Reilly.
The algorithms presented in this book are essentially single-threaded and execute on a single computer to produce a result. Often a problem can be decomposed into multiple constituent parts that can be executed independently, in parallel. The control structures become more complicated, but impressive speed-up can be achieved through parallelism.
In many real-worlds scenarios, you might be satisfied with an algorithm that efficiently computes an approximate answer to a really challenging problem. These algorithms are noticeably less computationally expensive than those producing the exact answer.
Instead of producing the exact same result when given the exact same input, a probabilistic algorithm introduces randomness into its logic, and it produces different results when given the same input. By running these less complex algorithms a large number of times, the average overall runs can converge on the actual result, which would otherwise be prohibitively expensive to compute.
No single book can cover the breadth of the field of algorithms. In 1962, Donald Knuth, a towering figure in computer science, began writing a 12-chapter book, The Art of Computer Programming (Addison-Wesley). Today—59 years later—the project has produced three volumes (published in 1968, 1969, and 1973) and the first part of volume 4 (published in 2011). With three more volumes planned for publication, the project is still not complete!
You can find countless ways to continue your study of algorithms, and I hope you are inspired to use this knowledge to improve the performance of your software applications.
1 For example, you can find the implementation of dict
in https://oreil.ly/jpI8F.
2 Find its source at https://oreil.ly/FWttm.
3 This capability was added in Python 3.7.