Chapter 5.5

Per.7: Design to enable optimization

Maximizing the frame rate

I make games for a living. More precisely, nowadays I help to make games for a liv-ing, as I work on a team of several hundred people, all with different areas of exper-tise. Besides the animators, artists, audio engineers, QA, and production folk, there is a large team of engineers.

It was not always like this. I used to make games single-handed in the 1980s, and games would live or die by their frame rate. In the UK, home computers would use the family PAL TV set for display, attached to a 50Hz electricity supply. CRT displays produced an interlaced image on alternate frames, so you would aim to run your entire game loop in 40 milliseconds, producing 25 frames each second.

This was an immovable number, bound by the laws of physics, or at least by domestic electricity supply. I developed for the Sinclair ZX Spectrum, which ran on a Z80 processor. I knew how many microseconds each instruction would take to execute. Through careful bookkeeping I knew how many scan lines would be swept during each chunk (I hesitate to use the word “function”). When I got to 625, I knew my time was up.

Of course, any notion of selling to the US market was impossible. Electricity was supplied at 60Hz, but the processor ran at the same speed. There was less time to do things, and I would have to cut features to achieve the higher frame rate. One advantage of the rise of high-definition TV was the consistency of resolution and frame rate: 1920 x 1080 pixels at 60Hz. Unfortunately, with the market fractured again between 4KTV and HDTV this period of consistency has ended. It is to be hoped that the hardware arms race will settle at 4K.

Returning to video game development in the 1980s, the process of optimization meant one thing, and one thing only: faster execution in less space. There was no optimizing just for speed: a fixed amount of RAM meant that loop unrolling was robbing Peter to pay Paul. Similarly, optimizing just for size would mean increasing the amount of time it took to display a frame.

Frequent tricks included hand-specified inlining. Once a function had been writ-ten and debugged, you could decide yourself whether to make the function call or to insert the code into the call site instead. Armed with the knowledge of how long each instruction would take to execute, and how much RAM this substitution would cost, this could be a very precise way of finding speed gains.

This substitution differs from loop unrolling. With unrolling, the body of the loop is duplicated one or more times to reduce the number of jumps back to the start of the loop. This carries costs in space of course. However, substituting a function body for a function call could sometimes save space, since the effort required to call a function in terms of preserving registers could be eliminated by simply duplicating the function.

Stack manipulation was a particular favorite. When the processor encounters a call instruction rather than a jump instruction, it stores the program counter on the stack and changes the program counter to the call operand. When the processor encounters a return instruction, it pops the stack contents into the program coun-ter. While this perfectly supports the idea of single point of entry, single point of return, it can be beautifully subverted to create jump tables. If you know a selection of functions must execute in a particular order, you could push their addresses onto the stack in reverse order of desired execution and then return. Each function would execute, and “return” to the next function in the stack. It was a valuable space saver, but fiendishly hard to debug and completely nonportable.

I found optimization to be the funnest part of making games. You could chip away at it until you decided that you had simply reached the end of the line, and you would declare your function optimal. Optimization relied upon a deep and intimate knowl-edge of the platform. When I was working with the Z80 processor I had to know that the register-loading instructions lay between 0x40 and 0x7f, and that the conditional return instructions started 0xc, 0xd, 0xe, or 0xf and ended 0x0 or 0x8, and that the xor instructions ran from 0xA8 to 0xAF, in the order b, c, d, e, h, l, (hl), a.

Working further from the metal

Things are different for me now. C++ takes me very close to the metal, but not on the metal. The compiler bridges that gap and embodies all the detailed knowledge that I used to have about the Z80 processor. In my current case that knowledge concerns the x86-64 architecture, which encompasses multiple processors. I cannot hope to acquire the necessary detail on my own. This is one of the reasons for moving to C++.

There is an art to optimization, but there is also an art to developing code that can be optimized well. You cannot optimize code that is utterly inflexible, and opti-mized code will often end up in this state. Optimizable code leaves options for the optimizer, be that a human or a compiler. Code that can be optimized is code with options to change things. Indeed, the words “optimize” and “option” both have the same Latin root, optare, “to choose.”

Engineers tend to optimize code in two ways, either by improving the design or by improving the performance. Improving the design means improving the compre-hensibility and reusability of the code or responding to refinements to the require-ments. Improving the performance means giving the compiler the greatest amount of information about what it is that you want to do and the greatest range of options to improve the performance.

The Core Guidelines page offers a good example from the C standard, which we reproduce here:

Click here to view code image

void qsort(void* base, size_t num, size_t size,
           int(*comp)(const void*, const void*));

This function takes a pointer to some memory, the number of things you want to sort, the size of the things you want to sort, and a function for ordering the things you want to sort.

Let’s start with optimizing the design of this function. The first problem is that we seem to be sorting memory rather than sorting things. This is the wrong level of abstraction. If you’re going to name a function “sort,” then the implication is that you will be sorting things. The idea of sorting memory is meaningless. Also, by tak-ing a void*, the function is discarding information known at the call site: the type of object.

We can fix this immediately by jumping to C++ and exchanging the void* for a template parameter:

Click here to view code image

template <typename T>
void sort(T* base, size_t num, size_t size,
             int(*comp)(const T*, const T*));

This optimization improves the design by catching a typical error at compile time rather than runtime: passing a nonsensical comparison function. A void* can point to anything at all, while a T* must point to an instance of T. If you had a set of float-ing-point values that you wanted to sort, and your comparison function sorted char-acters, a call to sort may yield the wrong result.

Next, consider the two size_t parameters. First, they are an accident waiting to happen. The client must take care to pass these parameters in the correct order; they will get no warning from the compiler if they get this wrong. However, the size parameter is now redundant. Since the type being sorted is now part of the function signature, the size is known at compile time. The signature now looks like this:

Click here to view code image

template <typename T>
void sort(T* base, size_t num, int(*comp)(const T*, const T*));

Specializing the function over the type gives the compiler more information. There may be optimizations available if the total volume of memory occupied by the objects being sorted is less than, for example, the size of the L1 cache.

The final parameter is a function pointer. This is a callback function. The sort function will invoke this when it needs to decide the order of two objects. Optimiza-tions are hard to come by here, but since sort is now a function template, it is reason-able to hope that the function will be instantiated at compile time. If the callback function definition is visible to the sort function, it may be inlined into the function template specialization.

There is a better solution, which is to make the callback function a template parameter. Consider this signature:

Click here to view code image

template <typename T, typename F>
void sort(T* base, size_t num, F fn);

There are going to be some restrictions on F. It needs to be invocable, taking two T*s and returning a negative integer value if the first argument precedes the second, zero if they are equal, and a positive integer value otherwise. After all, this is what the original function pointer did. However, we can now pass a lambda function rather than a function pointer, which makes inlining much more likely. We can also pass a std::function object, or even a member function pointer.

Of course, since we know that values will be passed to this function, we can use-fully constrain the function still further and take a pair of T const&s, allowing a broader range of functions to be passed and eliminating the need to test for null pointers. This is yet more information for the compiler to work with, and yet more opportunity for optimization.

What next? Well, the first two parameters are now a little suspicious. The qsort function requires that the items being sorted lie in contiguous memory. This gives us a few advantages. First, we can simply pass a pair of iterators. In all probability, you are starting off with a pair of iterators anyway, and using std::distance to calculate the number of elements. Working at the correct level of abstraction, you should pass in the range of values you want to sort:

Click here to view code image

template <typename InIt, typename F>
void sort(InIt first, InIt last, F fn);

This is a safer spelling because you do not have to concern yourself with any arith-metic. You only need to tell the function what the ends of the range are. This pro-vides a design optimization by reducing the opportunity for error, as well as a compilation optimization by announcing the addresses of the first and last elements, rather than requiring the compiler to emit code to calculate them. We can further improve the design by adding requires clauses now that we are in the brave new world of C++20. In fact, we can do even better than that and exchange the iterator pair for a range:

Click here to view code image

template <typename R, typename F>
void sort(R&& r, F fn);

Now we can add meaningful constraints, simplifying the design still further and giv-ing the compiler yet more information, by using the random_access_range concept:

Click here to view code image

template <std::ranges::random_access_range R, typename F>
void sort(R&&, F fn);

As we are sure you have guessed by now, what we have done here is followed the evo-lution of sort from its pre-C++ incarnation to its C++20 incarnation. You can still find qsort in the standard library, but there is very little reason to use it outside of legacy environments. The function signature quoted above is not from the standard; ranges work with projections, which are beyond the scope of this chapter. For com-pleteness, this is what one of the overloads looks like:

Click here to view code image

template <std::ranges::random_access_range R,
          typename Comp = std::ranges::less, typename Proj = std::identity>
requires std::sortable<std::ranges::iterator_t<R>, Comp, Proj>
constexpr std::ranges::borrowed_iterator_t<R>
sort(R&& r, Comp comp={}, Proj proj={});

This function:

Takes a range, a comparison function, and a projection
Requires that the range can be sorted with instances of the supplied types
Is a constexpr function
Returns an iterator to the final element

This function also gives the compiler much more information than qsort and enables a clearer call site.

However, you may look at that fistful of characters and contemplate the consider-able increase in the size of the declaration. You will, unfortunately, also experience an increase in compilation time. Rather than, as with qsort, simply inserting a func-tion call at the call site, the compiler will first infer the template parameters, ratify the constraints, instantiate then optimize the function template, and consider the value of inlining. You are exchanging run-time execution for compile-time execution.

Making things faster takes time. Spend that time: it is a wise investment. Be pre-pared to change your designs to facilitate performance improvements. However, don’t dive in with the virtual stopwatch combing through every function. Let’s first establish if you need to make that investment.

Optimization through abstraction

Recall the title of this chapter: “Design to enable optimization.” The set of steps just taken could just as well have been included in a chapter on abstraction. Not only did we potentially improve the performance characteristics of the function, but we also improved the abstraction of the function, and these two phenomena will often appear together. But the question must be asked: was this optimization premature?

During the 1990s, “premature optimization is the root of all evil” was a warning I heard very often. Usually, it was uttered when someone spent time hand-optimizing a function before the requirements had stabilized. The phrase itself is attributed to either Donald Knuth or Tony Hoare: one may have been quoting the other. Both are remarkable computer scientists; we would go so far as to say legendary, the giants on whose shoulders we stand. Hoare invented quicksort. Knuth invented TeX. We need to pay attention to what they say, but we also need to understand what they say, and know when the phrase is being misapplied.

In “Structured Programming with go to Statements,” published in Computing Surveys, Volume 6, Number 4, December 1974, Donald Knuth writes:

“There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncrit-ical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

“Yet we should not pass up our opportunities in that critical 3%. A good pro-grammer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using meas-urement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automati-cally unless it has been specifically turned off.”

A common misapplication derives from the Pareto Principle, which suggests that 80 percent of execution time will occur in 20 percent of the code. This being the case, the time to optimize is at the end of development, when that 20 percent can be identified. However, if that 20 percent is diffused throughout the source, then safe modification is going to be all but impossible. Additionally, as Knuth suggests, it is extremely hard to look at a program and estimate the location of the performance hot spots. This demotivates an engineer from improving sections of code because they cannot tell that spending time on a particular section will have an impact on performance.

Optimization should take place along with refactoring as requirements stabilize. Program-wide optimization is a fool’s errand. Choosing the right algorithm will impact performance to the greatest degree and making your code amenable to algo-rithm substitution will enable optimization. There was a time when you could sim-ply wait for the CPU to get faster, but that time has long gone.

Consider what the “3%” figure means: that is three lines of code in each one hun-dred. How long are your functions? There is a good chance that your short five-line functions are in good order, but it is the longer functions that benefit from closer inspection. The best optimizations you can make are those that improve the clarity of what you are doing and maximize the information you give to the compiler, such as swapping your hand-rolled loops for existing algorithms.

Returning to the sort example, one difference between qsort and std::ranges::sort is the return type. Rather than returning void as qsort does, std::ranges::sort returns an iterator equal to the end of the range. It is not necessary, but it is useful in some cases. One way of designing to enable optimization is to provide a family of functions with different amounts of information being returned, all imple-mented using the same fundamental algorithm.

The standard algorithms support this. The function std::is_sorted takes a range and a comparison function, and returns a bool telling you whether or not the ele-ments are sorted. std::is_sorted_until returns the iterator at which point a range stops being sorted. std::is_sorted could be implemented as a call to std::is_sorted_until and a check on whether the return value is the end of the range.

Similarly, std::mismatch compares two ranges, searching for the first correspond-ing pair of elements that don’t satisfy a predicate. std::equal is a specialization of this: the predicate is the equality operator, and if std::mismatch returns the end of the range, then the ranges are equal.

What this demonstrates is a pair of functions that support a single idea, but which operate at different levels of abstraction. You can choose the correct function for the level at which you are working, and by doing so you are offering the compiler neither more nor less than what it needs to know about the task at hand to generate optimal code.

Summary

Optimization is something that the compiler is best placed to do. You may have an idea about the sort of code that will be generated from your source, but it is the com-piler that will actually do the job. Signaling your intentions to the compiler, as fully and precisely as possible, is the key to getting the most out of it.

Focus on designing your interfaces for composition. As you develop a function, ask yourself if the parts that make it up could be expressed as individual functions that may have some subsidiary use. Hand optimization is not the only way to improve your code and is most likely premature optimization. Careful composition of func-tions and correct generalization will support superior optimization opportunities.

Everything has a cost. You may be aware of how many nanoseconds a CPU instruction takes, how many microseconds a thread context switch takes, how many milliseconds it takes to send a 1KB buffer halfway around the world, and so on. Comparison between magnitudes will serve you better than comparison across mag-nitudes. Leave the intra-magnitude work to the compiler.