Chapter 9 Performance

Performance or low latency is the sweet spot for C++, right? Therefore, I’m quite surprised that only a quarter of the 20 rules to performance have substantial content. Hence, I have to improvise a little to make a story out of the existing guidelines. The performance rules of the C++ Core Guidelines start with rules for wrong optimizations, continue with rules about wrong assumptions, and end with rules to enable optimization.

Wrong optimizations

A famous quote can nicely summarize the first three rules.

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

—Donald Knuth, The Art of Computer Programming (1974)

To make it short, keep the phrase “premature optimization is the root of all evil” in mind. Before you make any performance assumption, apply the most critical rule to performance analysis: Measure the performance of your program.

Without performance numbers, you don’t know the following:

Which part of your program is the bottleneck?
How fast is good enough for your user?
How fast could the program potentially be?

Make the performance test with real-world data and test under version control. Rerun those performance tests each time you change something in your infrastructure such as the hardware or the compiler.

Applying substantial optimization based on wrong assumptions is a typical antipattern.

Wrong assumptions

Before I continue, I have to make a disclaimer: I do not recommend using the infamous singleton pattern. The singleton has many drawbacks. You can read more about them in the section dedicated to singletons in Chapter 3. I use the singleton pattern in this section because the following example is based on a real-world example.

Let me show you that complicated and low-level code does not always pay off. To prove my point, I have to measure the performance of various singleton implementations.

The key idea of the performance test is to invoke the singleton pattern 40 million times from four threads, measure the execution time of each thread, and sum the numbers up. Four threads seem to be the right choice due to my four-core machine. The singleton pattern is initialized lazily; therefore, the first call has to initialize it.

Don’t take the performance numbers too seriously. They should only give a ballpark feeling.

My first implementation is based on the so-called Meyers singleton. It is thread safe because of the C++11 standard guarantees that a static variable with block scope is initialized in a thread-safe way.

Click here to view code image

// singletonMeyers.cpp
 
#include <chrono>
#include <iostream>
#include <future>
 
constexpr auto tenMill = 10'000'000;
 
class MySingleton {
public:
   static MySingleton& getInstance() {
      static MySingleton instance; 
      volatile int dummy{};
      return instance;
   }
private:
   MySingleton()= default;
   ~MySingleton()= default;
   MySingleton(const MySingleton&)= delete;
   MySingleton& operator = (const MySingleton&)= delete;
};
   
std::chrono::duration<double> getTime() {
   
   auto begin= std::chrono::system_clock::now();

   for (size_t i = 0; i < tenMill; ++i) {
      MySingleton::getInstance(); 
   }
   return std::chrono::system_clock::now() - begin;
   
};
   
int main() {
   
   auto fut1 = std::async(std::launch::async,getTime);
   auto fut2 = std::async(std::launch::async,getTime);
   auto fut3 = std::async(std::launch::async,getTime);
   auto fut4 = std::async(std::launch::async,getTime);
   
   auto total = fut1.get() + fut2.get() + 
                fut3.get() + fut4.get();
   
   std::cout << total.count() << '\n';
   
}

Line 12 uses the guarantee of the C++11 run time that the singleton is initialized in a thread-safe way. Each of the four threads in the main function invokes ten million times the singleton in line 27. In total, this makes 40 million calls. The volatile variable dummy at line 13 is necessary. Without the variable dummy, the optimizer would do a perfect job and remove the loop in lines 26–28. Of course, the performance numbers would be impressive.

Let’s try to do better. This time I use atomics to make the singleton pattern thread safe. My implementation is based on the infamous double-checked locking pattern. For the sake of simplicity, I will show only the implementation of the class MySingleton.

Click here to view code image

class MySingleton {
public:
   static MySingleton* getInstance() {
      MySingleton* sin= instance.load(std::memory_order_acquire);
      if ( !sin ) {
         std::lock_guard<std::mutex> myLock(myMutex);
   sin = instance.load(std::memory_order_relaxed);
   if( !sin ){
   sin = new MySingleton();
   instance.store(sin,std::memory_order_release);
   }
}

   volatile int dummy{};
   return sin;
   }
private:
   MySingleton()= default;
   ~MySingleton()= default;
   MySingleton(const MySingleton&)= delete;
   MySingleton& operator = (const MySingleton&)= delete;
   
   static std::atomic<MySingleton*> instance;
   static std::mutex myMutex;
};
   
   
std::atomic<MySingleton*> MySingleton::instance;
std::mutex MySingleton::myMutex;

To understand the implementation, you have to study the memory orderings, think about the acquire-release semantics, and think about the synchronization and ordering constraint. This is not an easy job, and it may take days.

But you know, highly sophisticated code pays off.

Darn. I forgot to apply the critical rule of performance optimization: “Per.6: Don’t make claims about performance without measurements”. Figure 9.1 shows the performance numbers for the Meyers singleton on Linux. I compiled the program, as always, with maximum optimization.

Now I’m curious. What are the numbers for my highly sophisticated code? See Figure 9.2.

80% percent slower! 80% percent slower, and we cannot even prove that the implementation is correct.

Are we done? No! I don’t have a baseline. The baseline should be your starting, not your ending, point for a performance test. How fast can 40 million invocations of a singleton be? Invoking 40 million times the singleton from a single-threaded implementation with no synchronization overhead provides me a good baseline. See Figure 9.3.

The single-threaded execution takes about 0.024 seconds, and the mulithreaded execution based on the Meyers singleton about 0.035 seconds. This means that the synchronization overhead makes the Meyers singleton 45% slower.

This small synchronization overhead is quite remarkable. If you want to know the entire story behind the thread-safe initialization of the singleton pattern, read the post “Thread-Safe Initialization of a Singleton” at https://www.modernescpp.com/index.php/thread-safe-initialization-of-a-singleton. This story includes additional implementations based on the function std::call_once and the std::once_flag, the lock std::lock_guard, and atomics using sequential consistency. The performance numbers provided are for Linux (GCC) and the Microsoft platform (cl.exe).

Enable optimization

The last section is about wrong assumptions. Now, I want to take an optimistic approach.

Per.7

Design to enable optimization

This rule applies, in particular, to move semantics because you should write your algorithms using move semantics and not copy semantics. Using move semantics automatically provides a few benefits.

Instead of an expensive copy operation, your algorithm uses a cheap move operation.
Your algorithm is a lot more stable because it requires no memory allocation, and therefore, std::bad_alloc exceptions are not possible.
You can use your algorithm with move-only types such as std::unique_ptr.

You may see a loophole in my argumentation. What happens when I use a copy-only type in an algorithm requiring move semantics?

Click here to view code image

// swap.cpp
   
#include <algorithm> 
#include <iostream>
#include <utility>
   
template <typename T> 
void swap(T& a, T& b) noexcept { // (2)
   T tmp(std::move(a));
   a = std::move(b);
   b = std::move(tmp);
}

class BigArray {
   
public:
   explicit BigArray(std::size_t sz): size(sz), data(new int[size]) {}
   
   BigArray(const BigArray& other): size(other.size), 
                                        data(new int[other.size]) {
      std::cout << "Copy constructor" << '\n';
      std::copy(other.data, other.data + size, data);
      }
   
      BigArray& operator = (const BigArray& other) { 
         std::cout << "Copy assignment" << '\n';
         if (this != &other){
            delete [] data;
            data = nullptr;
            size = other.size;
            data = new int[size];
            std::copy(other.data, other.data + size, data);
         }
         return *this;
      }
   
         ~BigArray() {
            delete[] data;
      }
private:
   std::size_t size;
   int* data;
};
   
int main(){
   
   std::cout << '\n';
   
   BigArray bigArr1(2011);
   BigArray bigArr2(2017);
   swap(bigArr1, bigArr2);               // (1) 
   
   std::cout << '\n';
   
}

BigArray does not support move semantics, only copy semantics. What happens if I swap the BigArrays in (1)? My swap algorithm uses move semantics (2) internally. Let’s try it out (see Figure 9.4).

Applying move operations on a copy-only type triggers copy operations. Copy semantics is a kind of fallback to move semantics. You can see it the other way around. Move is an optimized copy. How is this possible? I asked for a move operation in my swap algorithm. The reason is that std::move returns an rvalue. A const lvalue reference can bind to an rvalue, and the copy constructor or the copy-assignment operator takes a const lvalue reference. If BigArray would have a move-constructor or a move-assignment operator taking rvalue references, both would have higher priority than the copy constructor or the move-assignment operator. Implementing your algorithms with move semantics means that move semantics automatically kicks in if your data types support it. If not, copy semantics is used as a fallback. In the worst case, you get the classical behavior.

When you study the copy-assignment operator, you see that it has a number of flaws. Here they are:

The expression (if(this != &other)) checks for self-assignment. Most of the time self-assignment does not happen, but the check is always performed.
If the allocation (data = new int[size]) fails, this is already modified. The size is wrong, and data is already deleted. This behavior means the copy constructor guarantees only that there is no leak after an exception.
The expression (std::copy(other.data, other.data + size,data)) is used in the copy constructor and in the copy-assignment operator.

Implementing a swap function for BigArray and implementing the copy-assignment operator with the help of the swap function would solve all flaws. Here is the copy-assignment operator, which takes its argument by value. Consequently, a test for self-assignment is not necessary.

Click here to view code image

BigArray& operator = (BigArray other) { 
   swap(*this, other); 
   return *this;
}

BigArray still has a few flaws. Using a std::vector instead of a C-array solves them. The definition of BigArray boils down to a few lines:

Click here to view code image

class BigArray {
public:
   BigArray(std::size_t sz): vec(std::vector<int>(sz)) {}
private:
   std::vector<int> vec;
};

The compiler can autogenerate the big six if all members of the class support them. The big six include the default constructor, destructor, copy- and move-assignment operator, and copy and move constructor. std::vector supports the big six, and therefore, BigArray does support the big six with one exception. Due to the user-defined constructor, BigArray does not have a default constructor.

Per.10

Rely on the static type system

There are many ways that you can help the compiler to generate more optimized code.

Write local code: Using an in-place invoked lambda instead of a function to adjust the behavior of std::sort is in general the faster variant. The compiler has all the information available to generate the most optimized code. On the contrary, a function could be defined in another translation unit, which is a hard boundary for the optimizer.

Click here to view code image

bool lessLength(const std::string& f, const std::string& s){
   return f.size() < s.size();
}
   
int main() {
   
   std::vector<std::string> vec = {"12345", "123456", "1234", 
                                  "1", "12", "123", "12345"}; 
   
   // a function as predicate
   std::sort(vec.begin(), vec.end(), lessLength); 
   
   // a lambda as predicate
   std::sort(vec.begin(), vec.end(), 
            [](const std::string& f, const std::string& s) {
               return f.size() < s.size();
            }); 
}

Write simple code: The optimizer looks for known patterns that could be optimized. If your code is very handcrafted and complicated, you make the job of the optimizer to find known patterns harder. In the end, you often get less optimized code.
Give the compiler additional hints: When your function cannot throw, or you don’t care, declare it as noexcept. It is equally valuable to the optimizer to declare a virtual function as final if it should not be overridden.

Per.11

Move computation from run time to compile-time

The following example shows the gcd algorithm, which calculates the greatest common division at run time. gcd is implemented using the Euclidean algorithm.

Click here to view code image

int gcd(int a, int b) {
   while (b != 0) {
      auto t = b;
      b = a % b;
      a = t;
   }
   return a;
}

By declaring gcd as constexpr, gcd becomes a function that can be executed at compile time. There are a few restrictions on constexpr functions. gcd must not use static or thread_local variables, exception handling, or goto statements, and all variables have to be initialized and have a literal type. A literal type is essentially a built-in type, or a reference, or an array of literal types, or a class with a constexpr constructor.

Let's try it out.

Click here to view code image

// gcd.cpp
   
#include <iostream>
   
constexpr int gcd(int a, int b) {
   while (b != 0){
      auto t = b;
      b = a % b;
      a = t;
   }
   return a;
}
   
int main() {
   
   std::cout << '\n';
   
   constexpr auto res1 = gcd(121, 11);               // (1)
   std::cout << "gcd(121, 11) = " << res1 << '\n';
   
   auto val = 121;                                  // (3)
   auto res2 = gcd(val, 11);                        // (2) 
   std::cout << "gcd(val, 11) = " << res2 << '\n';
   
   std::cout << '\n';
   
}

Declaring gcd as a constexpr function does not mean that it has to run at compile time. It means that gcd has the potential to run at compile time. A constexpr function has to be executed at compile time if used in a constant expression. resl in (1) is a constant expression because I ask for the result with a constexpr variable res1. res2 in (2) is not a constant expression because val in (3) is not a constant expression. When I change res2 to constexpr auto res2, I get an error: val is not a constant expression. Figure 9.5 shows the output of the program.

Once more, here is the key observation. You can use a constexpr function at run time and compile time. To use it at compile time, its arguments have to be constant expressions.

Thanks to Compiler Explorer, I can show the relevant assembler instructions for this program. The invocation of the function gcd(121, 11) in line 18 boils down to its result as a constant. See Figure 9.6.

Per.19

Access memory predictably

What does predictably mean? For example, you read an int from memory more than the size of this one int is read from memory. An entire cache line is read from memory and stored in a CPU’s cache. On modern architectures, a cache line typically has 64 bytes. If you now request an additional variable from memory and this variable is already cached, the read directly uses this cache, and the operation is much faster.

A data structure such as std::vector, which stores its data in a contiguous memory block, is a cache-line-friendly data structure because each element in the cache line is typically used. This cache-line friendliness also holds for a std::array and std::string.

std::deque is in its structure similar to std::vector, but the elements of std::deque are not stored in a contiguous memory block. The elements of a std::deque are typically stored in a sequence of fixed-size arrays. The fixed-sized arrays are filled before new arrays are added to the std::deque. See Figure 9.7.

In contrast, std::list and std::forward_list are doubly or singly linked containers. A std::list increases in both directions, a std::forward_list in one direction. See Figures 9.8 and 9.9.

This was the theory of cache lines. Now I’m curious. Does it make a performance difference to read and accumulate all elements of a std::vector, a std::deque, a std::list, and a std::forward_list? This small program gives an answer.

Click here to view code image

 1 // memoryAccess.cpp
 2 
 3 #include <forward_list>
 4 #include <chrono>
 5 #include <deque>
 6 #include <iomanip>
 7 #include <iostream>
 8 #include <list>
 9 #include <string>
10 #include <vector>
11 #include <numeric>
12 #include <random>
13  
14 const int SIZE = 100'000'000; 
15 
16 template <typename T>
17 void sumUp(T& t, const std::string& cont) { 
18 
19      std::cout << std::fixed << std::setprecision(10);
20 
21   auto begin = std::chrono::steady_clock::now();
22      std::size_t res = std::accumulate(t.begin(), t.end(), 0LL);
23      std::chrono::duration<double> last = 
24            std::chrono::steady_clock::now() - begin;
25      std::cout << cont << '\n';
26      std::cout << "time: " << last.count() << '\n';
27      std::cout << "res: " << res << '\n';
28      std::cout << '\n';
29 
30      std::cout << '\n';
31 
32 }
33 
34 int main() {
35 
36      std::cout << '\n';
37 
38      std::random_device seed; 
39      std::mt19937 engine(seed());
40      std::uniform_int_distribution<int> dist(0, 100);
41

42   std::vector<int> randNum;
43   randNum.reserve(SIZE);
44   for (int i = 0; i < SIZE; ++i){
45      randNum.push_back(dist(engine));
46   }
47 
48   {
49      std::vector<int> vec(randNum.begin(), randNum.end());
50      sumUp(vec,"std::vector<int>"); 
51   }
52 
53 
54   {
55      std::deque<int>deq(randNum.begin(), randNum.end());
56      sumUp(deq,"std::deque<int>"); 
57   }
58 
59   {
60      std::list<int>lst(randNum.begin(), randNum.end());
61      sumUp(lst,"std::list<int>"); 
62   }
63 
64   {
65      std::forward_list<int>forwardLst(randNum.begin(), 
66                                         randNum.end());
67      sumUp(forwardLst,"std::forward_list<int>"); 
68   } 
69 
70 }

The program memoryAccess.cpp creates first 100 million random numbers between 0 and 100 (line 38). Then it accumulates the elements using a std::vector (line 50), a std::deque (line 56), a std::list (line 61), and a std::forward_list (line 67). The actual work is done in the function sumUp (lines 16–32). My educated guess is that GCC, Clang, and the Microsoft Visual Studio compiler use a quite similar implementation of std::accumulate.

Click here to view code image

template<class InputIt, class T>
T accumulate(InputIt first, InputIt last, T init) {
   for (; first != last; ++first) {
      init = init + *first;
   }
   return init;
}

Consequently, the access time of the elements is the dominant factor for the overall performance. See Figure 9.10.

Here are a few observations:

std::vector is 30 times faster than std::list or std::forward_list.
std::vector is 5 times faster than std::deque.
std::deque is 6 times faster than std::list and std::forward_list.
std::list and std::forward_list are in the same ballpark.

Although I got similar relative numbers on Linux with the GCC compiler, don’t take the performance numbers too seriously. The performance numbers give a solid indication that the access time to the elements heavily depends on the cache-line friendliness of the container.

Related rules

The section Metaprogramming in Chapter 13 provides an introduction to template metaprogramming, the type-traits library, and constexpr functions as a means to move computation from run time to compile time.