Some programmers pay too much attention to efficiency; by worrying too soon about little “optimizations” they create ruthlessly clever programs that are insidiously difficult to maintain. Others pay too little attention; they end up with beautifully structured programs that are utterly inefficient and therefore useless. Good programmers keep efficiency in context: it is just one of many problems in software, but it is sometimes very important.
Previous columns have discussed high-level approaches to efficiency: problem definition, system structure, algorithm design and data structure selection. This column is about a low-level approach. “Code tuning” locates the expensive parts of an existing program and then makes little changes to improve its speed. It’s not always the right approach to follow and it’s rarely glamorous, but it can sometimes make a big difference in a program’s performance.
Chris Van Wyk and I chatted about code tuning early one afternoon; he then wandered off to improve a C program. A few hours later, he had halved the run time of a three-thousand line graphics program.
Although the run time for typical images was much shorter, the program took ten minutes to process some complicated pictures. Van Wyk’s first step was to profile the program to find how much time was spent in each function (a profile of a similar, but smaller, program is shown on the next page). Running the program on ten typical test pictures showed that it spent almost seventy percent of its time in the memory allocation function malloc.
Van Wyk’s next step was to study the memory allocator. Because his program accessed malloc through a single function that provided error checking, he was able to modify that function without examining the source code of malloc. He inserted a few lines of accounting code, which showed that the most popular kind of record was allocated thirty times more frequently than the runner-up. If you knew that the majority of the program’s time was spent looking through storage for a single type of record, how would you modify it to make it faster?
Van Wyk solved his problem by applying the principle of caching: data that is accessed most often should be the cheapest to access. He modified his program by caching free records of the most common type in a linked list. He could then handle the common request by a quick access to that list rather than by invoking the general storage allocator; this reduced the total run time of his program to 45 percent of what it had been previously (so the storage allocator now took 30 percent of the total time). An additional benefit was that the reduced fragmentation of the modified allocator made more efficient use of main memory than the original allocator. Solution 2 shows an alternative implementation of this ancient technique; we’ll use a similar approach several times in Column 13.
This story illustrates the art of code tuning at its best. By spending a few hours measuring and adding about twenty lines to a 3000-line program, Van Wyk doubled its speed without altering the users’ view of the program or increasing the difficulty of maintenance. He used general tools to achieve the speedup: profiling identified the “hot spot” of his program and caching reduced the time spent there.
This profile of a typical little C program is similar, in both form and content, to Van Wyk’s profile:
Func Func+Child Hit
Time % Time % Count Function
--------------------------------------------------
1413.406 52.8 1413.406 52.8 200002 malloc
474.441 17.7 2109.506 78.8 200180 insert
285.298 10.7 1635.065 61.1 250614 rinsert
174.205 6.5 2675.624 100.0 1 main
157.135 5.9 157.135 5.9 1 report
143.285 5.4 143.285 5.4 200180 bigrand
27.854 1.0 91.493 3.4 1 initbins
On this run, the majority of the time went to malloc. Problem 2 encourages you to reduce the run time of this program by caching nodes.
We’ll turn now from a large program to several small functions. Each describes a problem that I’ve seen in various contexts. The problems consumed most of the run time in their applications, and the solutions use general principles.
Problem One — Integer Remainders. Section 2.3 sketches three algorithms for rotating a vector. Solution 2.3 implements a “juggling” algorithm with this operation in the inner loop:
k = (j + rotdist) % n;
The cost model in Appendix 3 shows that the C remainder operator % can be very expensive: while most arithmetic operations take about ten nanoseconds, the % operator takes close to 100 nanoseconds. We might be able to reduce the time of the program by implementing the operation with this code:
k = j + rotdist;
if (k >= n)
k -= n;
This replaces the expensive % operator with a comparison and (rarely) a subtraction. But will that make any difference in the complete function?
My first experiment ran the program with the rotation distance rotdist set to 1, and the run time dropped from 119n nanoseconds to 57n nanoseconds. The program was almost twice as fast, and the 62 nanoseconds observed in the speedup was close to what was predicted by the cost model.
My next experiment set rotdist to 10, and I was shocked to see that the run time of both methods was identical at 206n nanoseconds. Experiments like the graph in Solution 2.4 soon led me to conjecture the cause: At rotdist = 1, the algorithm accessed memory sequentially, and the remainder operator dominated the time. When rotdist = 10, though, the code accessed every tenth word of memory, and fetching RAM into caches became dominant.
In the old days, programmers learned that it was futile to try to speed up the computations in programs that spend their time doing input and output. On modern architectures, it is equally futile to try to reduce computation time when so many of the cycles are spent accessing memory.
Problem Two — Functions, Macros and Inline Code. Throughout Column 8, we had to compute the maximum of two values. In Section 8.4, for instance, we used code like this:
maxendinghere = max(maxendinghere, 0);
maxsofar = max(maxsofar, maxendinghere);
The max function returns the maximum of its two arguments:
float max(float a, float b)
{ return a > b ? a : b; }
The run time of this program is about 89n nanoseconds.
Old C programmers might have the knee-jerk reaction to replace this particular function with a macro:
#define max(a, b) ((a) > (b) ? (a) : (b))
This is certainly uglier and more error-prone. With many optimizing compilers, it will make no difference at all (such compilers write small functions in line). On my system, however, this change reduced the run time of Algorithm 4 from 89n nanoseconds to 47n nanoseconds, for a speedup factor of almost two.
My delight with this speedup evaporated when I measured the effect of changing Algorithm 3 in Section 8.3 to use a macro: for n = 10,000, its run time increased from ten milliseconds to a hundred seconds, for a slowdown factor of 10,000. The macro appeared to have increased the run time of Algorithm 3 from its previous O(n log n) to something closer to O(n2). I soon discovered that the call-by-name semantics of macros caused Algorithm 3 to call itself recursively more than twice, and therefore increased its asymptotic running time. Problem 4 gives a more extreme example of such a slowdown.
While C programmers have to fret about tradeoffs between performance and correctness, C++ programmers enjoy the best of both worlds. C++ allows one to request that a function be compiled inline, which combines the clean semantics of functions with the low overhead of macros.
Out of curiosity, I avoided both macros and functions, and wrote out the computation in if statements:
if (maxendinghere < 0)
maxendinghere = 0;
if (maxsofar < maxendinghere)
maxsofar = maxendinghere;
The run time was essentially unchanged.
Problem Three — Sequential Search. We’ll turn now to sequentially searching a (potentially unsorted) table:
int ssearchl(t)
for i = [0, n)
if x[i] == t
return i
return -1
This succinct code takes 4.06n nanoseconds, on the average, to look up an element that is in the array x. Because it looks at only half the elements in the array in a typical successful search, it spends about 8.1 nanoseconds at each table element.
The loop is svelte, but a tiny bit of fat may be trimmed. The inner loop has two tests: the first tests whether i is at the end of the array, and the second tests whether x[i] is the desired element. We can replace the first test with the second by placing a sentinel value at the end of the array:
int ssearch2(t)
hold = x[n]
x[n] = t
for (i = 0; ; i++)
if x[i] == t
break
x[n] = hold
if i == n
return -1
else
return i
This drops the run time to 3.87n nanoseconds, for a speedup of about five percent. This code assumes that the array has been allocated so that x[n] may be temporarily overwritten. It is careful to save x[n] and to restore it after the search; that is unnecessary in most applications, and will be deleted from the next version.
The innermost loop now contains just an increment, an array access, and a test. Is there any way to reduce that further? Our final sequential search unrolls the loop eight times to remove the increment; further unrolling did not make it faster.
int ssearch3(t)
x[n] = t
for (i = 0; ; i += 8)
if (x[i ] == t) { break }
if (x[i+1] == t) { i += 1; break }
if (x[i+2] == t) { i += 2; break }
if (x[i+3] == t) { i += 3; break }
if (x[i+4] == t) { i += 4; break }
if (x[i+5] == t) { i += 5; break }
if (x[i+6] == t) { i += 6; break }
if (x[i+7] == t) { i += 7; break }
if i == n
return -1
else
return i
This drops the time to 1.70n nanoseconds, for a reduction of about 56 percent. On old machines, reducing the overhead might have given a speedup of ten or twenty percent. On modern machines, though, loop unrolling can help to avoid pipeline stalls, to reduce branches, and to increase instruction-level parallelism.
Problem Four — Computing Spherical Distances. The final problem is typical of applications that deal with geographic or geometric data. The first part of the input is a set S of five thousand points on the surface of a globe; each point is represented by its latitude and longitude. After those points are stored in a data structure of our choice, the program reads the second part of the input: a sequence of twenty thousand points, each represented by latitude and longitude. For every point in that sequence, the program must tell which point in S is closest to it, where distance is measured as the angle between the rays from the center of the globe to the two points.
Margaret Wright encountered a similar problem at Stanford University in the early 1980’s as she computed maps to summarize data on the global distribution of certain genetic traits. Her straightforward solution represented the set S by an array of latitude and longitude values. The nearest neighbor to each point in the sequence was found by calculating its distance to every point in S using a complicated trigonometric formula involving ten sine and cosine functions. While the program was simple to code and produced fine maps for small data sets, each large map required several hours of mainframe time, which was well beyond the project’s budget.
Because I had previously worked on geometric problems, Wright asked me to try my hand at this one. After spending much of a weekend on it, I developed several fancy algorithms and data structures for solving it. Fortunately (in retrospect), each would have required many hundreds of lines of code, so I didn’t try to code any of them. When I described the data structures to Andrew Appel, he had a key insight: Rather than approach the problem at the level of data structures, why not use the simple data structure of keeping the points in an array, but tune the code to reduce the cost of computing the distances between points? How would you exploit this idea?
The cost can be greatly reduced by changing the representation of points: rather than using latitudes and longitudes, we’ll represent a point’s location on the surface of the globe by its x, y and z coordinates. Thus the data structure is an array that holds each point’s latitude and longitude (which may be needed for other operations) as well as its three Cartesian coordinates. As each point in the sequence is processed, a few trigonometric functions translate its latitude and longitude into x, y and z coordinates, and we then compute its distance to every point in S. Its distance to a point in S is computed as the sum of the squares of the differences in the three dimensions, which is usually much cheaper than computing one trigonometric function, let alone ten. (The cost model of run times in Appendix 3 gives details for one system.) This method computes the correct answer because the angle between two points increases monotonically with the square of their Euclidean distance.
Although this approach does require additional storage, it yields substantial benefits: when Wright incorporated the change into her program, the run time for complicated maps was reduced from several hours to half a minute. In this case, code tuning solved the problem with a couple of dozen lines of code, while algorithmic and data structure changes would have required many hundreds of lines.
We’ll turn now to one of the most extreme examples I know of code tuning. The details are from Problem 4.8: we are to perform a binary search in a table of one thousand integers. As we go through the process, keep in mind that tuning is usually not needed in binary search — the algorithm is so efficient that code tuning is often superfluous. In Column 4, we therefore ignored microscopic efficiency and concentrated on achieving a simple, correct and maintainable program. Sometimes, though, the tuned search can make a big difference in a system.
We’ll develop the fast binary search in a sequence of four programs. They are subtle, but there is good reason to follow them closely: the final program is usually two or three times faster than the binary search of Section 4.2. Before reading on, can you spot any obvious waste in that code?
l = 0; u = n-1
loop
/* invariant: if t is present, it is in x[l..u] */
if l > u
p = -1; break
m = (1 + u) / 2
case
x[m] < t: l = m+1
x[m] == t: p = m; break
x[m] > t: u = m-1
Our development of the fast binary search will start with the modified problem of locating the first occurrence of the integer t in the integer array x[0..n – 1]; the above code might return any one of multiple occurrences of t (we need just such a search in Section 15.3). The main loop of the program is similar to the one above; we’ll keep indices l and u into the array that bracket the position of t, but we’ll use the invariant relation that x[l]<t≤x[u] and l<u. We’ll assume that n≥0, that x[-1]<t and that x[n]≥t (but the program will never access the two fictitious elements). The binary search code is now
l = -1; u = n
while l+1 != u
/* invariant: x[l] < t && x[u] >= t && 1 < u */
m = (l + u) / 2
if x[m] < t
l = m
else
u = m
/* assert l+1 = u && x[l] < t && x[u] >= t */
p = u
if p >= n || x[p] != t
p = -1
The first line initializes the invariant. As the loop is repeated, the invariant is maintained by the if statement; it’s easy to check that both branches maintain the invariant. Upon termination we know that if t is anywhere in the array, then its first occurrence is in position u; that fact is stated more formally in the assert comment. The final two statements set p to the index of the first occurrence of t in x if it is present, and to – 1 if it is not present.
While this binary search solves a more difficult problem than the previous program, it is potentially more efficient: it makes only one comparison of t to an element of x in each iteration of the loop. The previous program sometimes had to test two such outcomes.
The next version of the program is the first to exploit the fact that we know that n is 1000. It uses a different representation of a range: instead of representing l..u by its lower and upper values, we’ll represent it by its lower value l and an increment i such that l + i = u. The code will ensure that i is at all times a power of two; this property is easy to keep once we have it, but it is hard to get originally (because the array is of size n = 1000). The program is therefore preceded by an assignment and if statement to ensure that the range being searched is initially of size 512, the largest power of two less than 1000. Thus l and l+i together represent either –1..511 or 488.. 1000. Translating the previous binary search program to this new representation of a range yields this code:
i = 512
1 = -1
if x[511] < t
l = 1000 – 512
while i != 1
/* invariant: x[l] < t && x[l+i] >= t && i = 2^j */
nexti = i / 2
if x[l+nexti] < t
l = 1 + nexti
i = nexti
else
i = nexti
/* assert i == 1 && x[l] < t && x[l+i] >= t */
p = l+1
if p > 1000 || x[p] != t
p = -1
The correctness proof of this program has exactly the same flow as the proof of the previous program. This code is usually slower than its predecessor, but it opens the door for future speedups.
The next program is a simplification of the above, incorporating some optimizations that a smart compiler might perform. The first if statement is simplified, the variable nexti is removed, and the assignments to nexti are removed from the inner if statement.
i = 512
l = -1
if x[511] < t
l = 1000 – 512
while i != 1
/* invariant: x[l] < t && x[l+i] >= t && i = 2^j */
i = i / 2
if x[l+i] < t
l = l + i
/* assert i == 1 && x[l] < t && x[l+i] >= t */
p = l+1
if p > 1000 || x[p] != t
p = –1
Although the correctness argument for the code still has the same structure, we can now understand its operation on a more intuitive level. When the first test fails and l stays zero, the program computes the bits of p in order, most significant bit first.
The final version of the code is not for the faint of heart. It removes the overhead of loop control and the division of i by two by unrolling the entire loop. Because i assumes only ten distinct values in this program, we can write them all down in the code, and thereby avoid computing them over and over again at run time.
l = -1
if (x[511] < t) l = 1000 – 512
/* assert x[l] < t && x[l+512] >= t */
if (x[l+256] < t) l += 256
/* assert x[l] < t && x[l+256] >= t */
if (x[l+128] < t) l += 128
if (x[l+64 ] < t) l += 64
if (x[l+32 ] < t) l += 32
if (x[l+16 ] < t) l += 16
if (x[l+8 ] < t) l += 8
if (x[l+4 ] < t) l += 4
if (x[l+2 ] < t) l += 2
/* assert x[l] < t && x[l+2 ] >= t */
if (x[l+l ] < t) l += 1
/* assert x[l] < t && x[l+l ] >= t */
p = l+1
if p > 1000 || x[p] != t
p = -1
We can understand this code by inserting the complete string of assertions like those surrounding the test of x[l + 256]. Once you do the two-case analysis to see how that if statement behaves, all the other if statements fall into line.
I’ve compared the clean binary search of Section 4.2 with this fine-tuned binary search on a variety of systems. The first edition of this book reported the run times across four machines, five programming languages, and several levels of optimizations; the speedups ranged from a 38 percent reduction to a factor of five (an 80 percent reduction). When I measured it on my current machine, I was shocked and delighted to see a reduction in search time for n = 1000 from 350 nanoseconds to 125 nanoseconds per search (a reduction of 64 percent).
The speedup seemed too good to be true, and it was. Close inspection showed that my timing scaffolding was searching for each array element in order: first x[0], then x[1], and so forth. This gave binary search particularly favorable memory access patterns and wonderful branch prediction. I therefore changed the scaffolding to search for the elements in random order. The clean binary search took 418 nanoseconds, while the loop-unrolled version took 266 nanoseconds, for a speedup of 36 percent.
This derivation is an idealized account of code tuning at its most extreme. We replaced the obvious binary search program (which doesn’t appear to have much fat on it) with a super-lean version that is substantially faster. (This function has been known in the computing underground since the early 1960’s. I learned it from Guy Steele in the early 1980’s; he learned it at MIT, where it had been known since the late 1960’s. Vic Vyssotsky used this code at Bell Labs in 1961; each if statement in the pseudocode was implemented in three IBM 7090 instructions.)
The program verification tools of Column 4 played a crucial role in the task. Because we used them, we can believe that the final program is correct; when I first saw the final code presented with neither derivation nor verification, I looked upon it as magic.
The most important principle about code tuning is that it should be done rarely. That sweeping generalization is explained by the following points.
The Role of Efficiency. Many other properties of software are as important as efficiency, if not more so. Don Knuth has observed that premature optimization is the root of much programming evil; it can compromise the correctness, functionality and maintainability of programs. Save concern for efficiency for when it matters.
Measurement Tools. When efficiency is important, the first step is to profile the system to find out where it spends its time. Profiling a program usually shows that most of the time is going to a few hot spots and that the rest of the code is rarely executed (in Section 6.1, for instance, a single function accounted for 98 percent of the run time). Profiling points to the critical areas; for the other parts we follow the wise maxim of, “If it ain’t broke, don’t fix it.” A cost model for run times like that in Appendix 3 can help a programmer to understand why certain operations and functions are expensive.
Design Levels. We saw in Column 6 that efficiency problems can be attacked in many ways. Before tuning code, we should make sure that other approaches don’t provide a more effective solution.
When Speedups Are Slowdowns. Replacing the % remainder operator with an if statement sometimes gave a factor-of-two speedup and other times made no difference in run time. Converting a function to a macro sped up one function by a factor of two, and slowed down another by ten thousand. After making an “improvement”, it is critical to measure its effect on representative input. Because of these stories and dozens more like them, we must be careful to heed Jurg Nievergelt’ s warning to code tuners: people who play with bits should expect to get bitten.
The discussion above considers whether and when to tune code; once we decide to do so, that still leaves the question of how. Appendix 4 contains a list of general rules for code tuning. All the examples we’ve seen can be explained in terms of those principles; I’ll do that now with the names of the rules in italics.
Van Wyk’s Graphics Program. The general strategy of Van Wyk’s solution was to Exploit Common Cases; his particular exploitation involved Caching a list of the most common kind of record.
Problem One — Integer Remainders. The solution Exploited an Algebraic Identity to replace an expensive remainder operation with a cheap comparison.
Problem Two — Functions, Macros and Inline Code. Collapsing a Procedure Hierarchy by replacing a function with a macro gave a speedup of almost a factor of two, but writing the code inline made no further difference.
Problem Three — Sequential Search. Using a sentinel to Combine Tests gave a speedup of about five percent. Loop Unrolling gave an additional speedup of about 56 percent.
Problem Four — Computing Spherical Distances. Storing Cartesian coordinates along with latitudes and longitudes is an example of Data Structure Augmentation; using the cheaper Euclidean distance rather than the angular distance Exploits an Algebraic Identity.
Binary Search. Combining Tests reduced the number of array comparisons per inner loop from two to one, Exploiting an Algebraic Identity changed representations from a lower and upper bound to a lower bound and an increment, and Loop Unrolling expanded the program to remove all loop overhead.
So far we’ve tuned code to reduce CPU time. One can tune code for other purposes, such as reducing paging or increasing a cache hit ratio. Perhaps the most common use of code tuning beyond reducing run time is to reduce the space required by a program; the next column is devoted to the topic.
1. Profile one of your own programs, and then try to use the approach described in this column to reduce the run time of its hot spots.
2. This book’s web site contains the C program profiled in the introduction; it implements a small subset of a C++ program that we’ll see in Column 13. Try profiling it on your system. Unless you have a particularly efficient malloc function, it will probably spend most of its time in malloc. Try to reduce that time by implementing a node cache like Van Wyk’s.
3. What special properties of the “juggling” rotation algorithm allowed us to replace the % remainder operator with an if statement, and not a more costly while statement? Experiment to determine when it is worth replacing a % remainder operator with a while statement.
4. When n is a positive integer at most the size of the array, this recursive C function returns the maximum value in the array x[0..n – 1]:
float arrmax(int n)
{ if (n == 1)
return x[0];
else
return max(x[n-l], arrmax(n-l));
}
When max is a function, it finds the maximum element of a vector with n = 10,000 elements in a few milliseconds. When max is the C macro
#define max(a, b) ((a) > (b) ? (a) : (b))
this algorithm takes 6 seconds to find the maximum of n = 27 elements and 12 seconds to find the maximum of n = 28 elements. Give an input that tickles this dreadful behavior, and analyze the run time mathematically.
5. How do the various binary search algorithms behave if they are (against specification) applied to unsorted arrays?
6. C and C++ libraries provide character classification functions such as isdigit, isupper and islower to determine the types of characters. How would you implement those functions?
7. Given a very long sequence (say, billions or trillions) of bytes, how would you efficiently count the total number of one bits? (That is, how many bits are turned on in the entire sequence?)
8. How can sentinels be used in a program to find the maximum element in an array?
9. Because sequential search is simpler than binary search, it is usually more efficient for small tables. On the other hand, the logarithmic number of comparisons made by binary search implies that it will be faster than the linear time of sequential search for large tables. The break-even point is a function of how much each program is tuned. How low and how high can you make that break-even point? What is it on your machine when both programs are equally tuned?
10. D. B. Lomet observes that hashing may solve the 1000-integer search problem more efficiently than the tuned binary search. Implement a fast hashing program and compare it to the tuned binary search; how do they compare in terms of speed and space?
11. In the early 1960’s, Vic Berecz found that most of the time in a simulation program at Sikorsky Aircraft was devoted to computing trigonometric functions. Further investigation showed that the functions were computed only at integral multiples of five degrees. How did he reduce the run time?
12. One sometimes tunes programs by thinking about mathematics rather than code. To evaluate the polynomial
y = anxn + an–1xn –1 + ... + a1x1 + a0
the following code uses 2n multiplications. Give a faster function.
y = a[0]
xi = 1
for i = [1, n]
xi = x * xi
y = y + a[i]*xi
Section 3.8 describes Steve McConnell’s Code Complete. Chapter 28 of his book describes “Code-Tuning Strategy”; it gives an overview of performance in general, and describes the approach of code-tuning in detail. Chapter 29 is an excellent collection of rules for code tuning.
Appendix 4 of this book presents a related collection of code-tuning rules, together with descriptions of how they have been applied in this book.