Cippi’s performance test
Performance or low latency is the sweet spot for C++, right? Therefore, I’m quite surprised that only a quarter of the 20 rules to performance have substantial content. Hence, I have to improvise a little to make a story out of the existing guidelines. The performance rules of the C++ Core Guidelines start with rules for wrong optimizations, continue with rules about wrong assumptions, and end with rules to enable optimization.
A famous quote can nicely summarize the first three rules.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
—Donald Knuth, The Art of Computer Programming (1974)
To make it short, keep the phrase “premature optimization is the root of all evil” in mind. Before you make any performance assumption, apply the most critical rule to performance analysis: Measure the performance of your program.
Without performance numbers, you don’t know the following:
Which part of your program is the bottleneck?
How fast is good enough for your user?
How fast could the program potentially be?
Make the performance test with real-world data and test under version control. Rerun those performance tests each time you change something in your infrastructure such as the hardware or the compiler.
Applying substantial optimization based on wrong assumptions is a typical antipattern.
Per.4: Don’t assume that complicated code is necessarily faster than simple code
Per.5: Don’t assume that low-level code is necessarily faster than high-level code
Per.6: Don’t make claims about performance without measurements
Before I continue, I have to make a disclaimer: I do not recommend using the infamous singleton pattern. The singleton has many drawbacks. You can read more about them in the section dedicated to singletons in Chapter 3. I use the singleton pattern in this section because the following example is based on a real-world example.
Let me show you that complicated and low-level code does not always pay off. To prove my point, I have to measure the performance of various singleton implementations.
The key idea of the performance test is to invoke the singleton pattern 40 million times from four threads, measure the execution time of each thread, and sum the numbers up. Four threads seem to be the right choice due to my four-core machine. The singleton pattern is initialized lazily; therefore, the first call has to initialize it.
Don’t take the performance numbers too seriously. They should only give a ballpark feeling.
My first implementation is based on the so-called Meyers singleton. It is thread safe because of the C++11 standard guarantees that a static variable with block scope is initialized in a thread-safe way.
// singletonMeyers.cpp #include <chrono> #include <iostream> #include <future> constexpr auto tenMill = 10'000'000; class MySingleton { public: static MySingleton& getInstance() { static MySingleton instance; volatile int dummy{}; return instance; } private: MySingleton()= default; ~MySingleton()= default; MySingleton(const MySingleton&)= delete; MySingleton& operator = (const MySingleton&)= delete; }; std::chrono::duration<double> getTime() { auto begin= std::chrono::system_clock::now(); for (size_t i = 0; i < tenMill; ++i) { MySingleton::getInstance(); } return std::chrono::system_clock::now() - begin; }; int main() { auto fut1 = std::async(std::launch::async,getTime); auto fut2 = std::async(std::launch::async,getTime); auto fut3 = std::async(std::launch::async,getTime); auto fut4 = std::async(std::launch::async,getTime); auto total = fut1.get() + fut2.get() + fut3.get() + fut4.get(); std::cout << total.count() << '\n'; }
Line 12 uses the guarantee of the C++11 run time that the singleton is initialized in a thread-safe way. Each of the four threads in the main function invokes ten million times the singleton in line 27. In total, this makes 40 million calls. The volatile
variable dummy
at line 13 is necessary. Without the variable dummy
, the optimizer would do a perfect job and remove the loop in lines 26–28. Of course, the performance numbers would be impressive.
Let’s try to do better. This time I use atomics to make the singleton pattern thread safe. My implementation is based on the infamous double-checked locking pattern. For the sake of simplicity, I will show only the implementation of the class MySingleton
.
class MySingleton { public: static MySingleton* getInstance() { MySingleton* sin= instance.load(std::memory_order_acquire); if ( !sin ) { std::lock_guard<std::mutex> myLock(myMutex); sin = instance.load(std::memory_order_relaxed); if( !sin ){ sin = new MySingleton(); instance.store(sin,std::memory_order_release); } } volatile int dummy{}; return sin; } private: MySingleton()= default; ~MySingleton()= default; MySingleton(const MySingleton&)= delete; MySingleton& operator = (const MySingleton&)= delete; static std::atomic<MySingleton*> instance; static std::mutex myMutex; }; std::atomic<MySingleton*> MySingleton::instance; std::mutex MySingleton::myMutex;
To understand the implementation, you have to study the memory orderings, think about the acquire-release semantics, and think about the synchronization and ordering constraint. This is not an easy job, and it may take days.
But you know, highly sophisticated code pays off.
Darn. I forgot to apply the critical rule of performance optimization: “Per.6: Don’t make claims about performance without measurements”. Figure 9.1 shows the performance numbers for the Meyers singleton on Linux. I compiled the program, as always, with maximum optimization.
Figure 9.1 Performance of the Meyers singleton
Now I’m curious. What are the numbers for my highly sophisticated code? See Figure 9.2.
Figure 9.2 Performance of the singleton based on acquire-release semantics
80% percent slower! 80% percent slower, and we cannot even prove that the implementation is correct.
Are we done? No! I don’t have a baseline. The baseline should be your starting, not your ending, point for a performance test. How fast can 40 million invocations of a singleton be? Invoking 40 million times the singleton from a single-threaded implementation with no synchronization overhead provides me a good baseline. See Figure 9.3.
Figure 9.3 Performance of the singleton in the single-threaded case
The single-threaded execution takes about 0.024 seconds, and the mulithreaded execution based on the Meyers singleton about 0.035 seconds. This means that the synchronization overhead makes the Meyers singleton 45% slower.
This small synchronization overhead is quite remarkable. If you want to know the entire story behind the thread-safe initialization of the singleton pattern, read the post “Thread-Safe Initialization of a Singleton” at https://www.modernescpp.com/index.php/thread-safe-initialization-of-a-singleton. This story includes additional implementations based on the function std::call_once
and the std::once_flag
, the lock std::lock_guard
, and atomics using sequential consistency. The performance numbers provided are for Linux (GCC) and the Microsoft platform (cl.exe).
The last section is about wrong assumptions. Now, I want to take an optimistic approach.
This rule applies, in particular, to move semantics because you should write your algorithms using move semantics and not copy semantics. Using move semantics automatically provides a few benefits.
Instead of an expensive copy operation, your algorithm uses a cheap move operation.
Your algorithm is a lot more stable because it requires no memory allocation, and therefore, std::bad_alloc
exceptions are not possible.
You can use your algorithm with move-only types such as std::unique_ptr
.
You may see a loophole in my argumentation. What happens when I use a copy-only type in an algorithm requiring move semantics?
// swap.cpp #include <algorithm> #include <iostream> #include <utility> template <typename T> void swap(T& a, T& b) noexcept { // (2) T tmp(std::move(a)); a = std::move(b); b = std::move(tmp); } class BigArray { public: explicit BigArray(std::size_t sz): size(sz), data(new int[size]) {} BigArray(const BigArray& other): size(other.size), data(new int[other.size]) { std::cout << "Copy constructor" << '\n'; std::copy(other.data, other.data + size, data); } BigArray& operator = (const BigArray& other) { std::cout << "Copy assignment" << '\n'; if (this != &other){ delete [] data; data = nullptr; size = other.size; data = new int[size]; std::copy(other.data, other.data + size, data); } return *this; } ~BigArray() { delete[] data; } private: std::size_t size; int* data; }; int main(){ std::cout << '\n'; BigArray bigArr1(2011); BigArray bigArr2(2017); swap(bigArr1, bigArr2); // (1) std::cout << '\n'; }
BigArray
does not support move semantics, only copy semantics. What happens if I swap the BigArray
s in (1)? My swap
algorithm uses move semantics (2) internally. Let’s try it out (see Figure 9.4).
Figure 9.4 Move semantics on a copy-only type
Applying move operations on a copy-only type triggers copy operations. Copy semantics is a kind of fallback to move semantics. You can see it the other way around. Move is an optimized copy. How is this possible? I asked for a move operation in my swap algorithm. The reason is that std::move
returns an rvalue. A const
lvalue reference can bind to an rvalue, and the copy constructor or the copy-assignment operator takes a const
lvalue reference. If BigArray
would have a move-constructor or a move-assignment operator taking rvalue references, both would have higher priority than the copy constructor or the move-assignment operator. Implementing your algorithms with move semantics means that move semantics automatically kicks in if your data types support it. If not, copy semantics is used as a fallback. In the worst case, you get the classical behavior.
When you study the copy-assignment operator, you see that it has a number of flaws. Here they are:
The expression (if(this != &other)
) checks for self-assignment. Most of the time self-assignment does not happen, but the check is always performed.
If the allocation (data = new int[size]
) fails, this
is already modified. The size
is wrong, and data
is already deleted. This behavior means the copy constructor guarantees only that there is no leak after an exception.
The expression (std::copy(other.data, other.data + size,data)
) is used in the copy constructor and in the copy-assignment operator.
Implementing a swap
function for BigArray
and implementing the copy-assignment operator with the help of the swap
function would solve all flaws. Here is the copy-assignment operator, which takes its argument by value. Consequently, a test for self-assignment is not necessary.
BigArray& operator = (BigArray other) { swap(*this, other); return *this; }
BigArray
still has a few flaws. Using a std::vector
instead of a C-array solves them. The definition of BigArray
boils down to a few lines:
class BigArray { public: BigArray(std::size_t sz): vec(std::vector<int>(sz)) {} private: std::vector<int> vec; };
The compiler can autogenerate the big six if all members of the class support them. The big six include the default constructor, destructor, copy- and move-assignment operator, and copy and move constructor. std::vector
supports the big six, and therefore, BigArray
does support the big six with one exception. Due to the user-defined constructor, BigArray
does not have a default constructor.
There are many ways that you can help the compiler to generate more optimized code.
Write local code: Using an in-place invoked lambda instead of a function to adjust the behavior of std::sort
is in general the faster variant. The compiler has all the information available to generate the most optimized code. On the contrary, a function could be defined in another translation unit, which is a hard boundary for the optimizer.
bool lessLength(const std::string& f, const std::string& s){ return f.size() < s.size(); } int main() { std::vector<std::string> vec = {"12345", "123456", "1234", "1", "12", "123", "12345"}; // a function as predicate std::sort(vec.begin(), vec.end(), lessLength); // a lambda as predicate std::sort(vec.begin(), vec.end(), [](const std::string& f, const std::string& s) { return f.size() < s.size(); }); }
Write simple code: The optimizer looks for known patterns that could be optimized. If your code is very handcrafted and complicated, you make the job of the optimizer to find known patterns harder. In the end, you often get less optimized code.
Give the compiler additional hints: When your function cannot throw, or you don’t care, declare it as noexcept
. It is equally valuable to the optimizer to declare a virtual function as final
if it should not be overridden.
The following example shows the gcd
algorithm, which calculates the greatest common division at run time. gcd
is implemented using the Euclidean algorithm.
int gcd(int a, int b) { while (b != 0) { auto t = b; b = a % b; a = t; } return a; }
By declaring gcd
as constexpr,
gcd
becomes a function that can be executed at compile time. There are a few restrictions on constexpr
functions. gcd
must not use static
or thread_local
variables, exception handling, or goto
statements, and all variables have to be initialized and have a literal type. A literal type is essentially a built-in type, or a reference, or an array of literal types, or a class with a constexpr
constructor.
Let's try it out.
// gcd.cpp #include <iostream> constexpr int gcd(int a, int b) { while (b != 0){ auto t = b; b = a % b; a = t; } return a; } int main() { std::cout << '\n'; constexpr auto res1 = gcd(121, 11); // (1) std::cout << "gcd(121, 11) = " << res1 << '\n'; auto val = 121; // (3) auto res2 = gcd(val, 11); // (2) std::cout << "gcd(val, 11) = " << res2 << '\n'; std::cout << '\n'; }
Declaring gcd
as a constexpr
function does not mean that it has to run at compile time. It means that gcd
has the potential to run at compile time. A constexpr
function has to be executed at compile time if used in a constant expression. resl
in (1) is a constant expression because I ask for the result with a constexpr
variable res1
. res2
in (2) is not a constant expression because val
in (3) is not a constant expression. When I change res2
to constexpr auto res2
, I get an error: val
is not a constant expression. Figure 9.5 shows the output of the program.
Figure 9.5 Invoking gcd
at compile time and run time
Once more, here is the key observation. You can use a constexpr
function at run time and compile time. To use it at compile time, its arguments have to be constant expressions.
Thanks to Compiler Explorer, I can show the relevant assembler instructions for this program. The invocation of the function gcd(121, 11)
in line 18 boils down to its result as a constant. See Figure 9.6.
Figure 9.6 The relevant assembler instructions to the algorithm gcd
What does predictably mean? For example, you read an int
from memory more than the size of this one int
is read from memory. An entire cache line is read from memory and stored in a CPU’s cache. On modern architectures, a cache line typically has 64 bytes. If you now request an additional variable from memory and this variable is already cached, the read directly uses this cache, and the operation is much faster.
A data structure such as std::vector
, which stores its data in a contiguous memory block, is a cache-line-friendly data structure because each element in the cache line is typically used. This cache-line friendliness also holds for a std::array
and std::string
.
std::deque
is in its structure similar to std::vector
, but the elements of std::deque
are not stored in a contiguous memory block. The elements of a std::deque
are typically stored in a sequence of fixed-size arrays. The fixed-sized arrays are filled before new arrays are added to the std::deque
. See Figure 9.7.
Figure 9.7 std::deque
In contrast, std::list
and std::forward_list
are doubly or singly linked containers. A std::list
increases in both directions, a std::forward_list
in one direction. See Figures 9.8 and 9.9.
Figure 9.8 std::list
Figure 9.9 std::forward_list
This was the theory of cache lines. Now I’m curious. Does it make a performance difference to read and accumulate all elements of a std::vector
, a std::deque
, a std::list
, and a std::forward_list?
This small program gives an answer.
1 // memoryAccess.cpp 2 3 #include <forward_list> 4 #include <chrono> 5 #include <deque> 6 #include <iomanip> 7 #include <iostream> 8 #include <list> 9 #include <string> 10 #include <vector> 11 #include <numeric> 12 #include <random> 13 14 const int SIZE = 100'000'000; 15 16 template <typename T> 17 void sumUp(T& t, const std::string& cont) { 18 19 std::cout << std::fixed << std::setprecision(10); 20 21 auto begin = std::chrono::steady_clock::now(); 22 std::size_t res = std::accumulate(t.begin(), t.end(), 0LL); 23 std::chrono::duration<double> last = 24 std::chrono::steady_clock::now() - begin; 25 std::cout << cont << '\n'; 26 std::cout << "time: " << last.count() << '\n'; 27 std::cout << "res: " << res << '\n'; 28 std::cout << '\n'; 29 30 std::cout << '\n'; 31 32 } 33 34 int main() { 35 36 std::cout << '\n'; 37 38 std::random_device seed; 39 std::mt19937 engine(seed()); 40 std::uniform_int_distribution<int> dist(0, 100); 41 42 std::vector<int> randNum; 43 randNum.reserve(SIZE); 44 for (int i = 0; i < SIZE; ++i){ 45 randNum.push_back(dist(engine)); 46 } 47 48 { 49 std::vector<int> vec(randNum.begin(), randNum.end()); 50 sumUp(vec,"std::vector<int>"); 51 } 52 53 54 { 55 std::deque<int>deq(randNum.begin(), randNum.end()); 56 sumUp(deq,"std::deque<int>"); 57 } 58 59 { 60 std::list<int>lst(randNum.begin(), randNum.end()); 61 sumUp(lst,"std::list<int>"); 62 } 63 64 { 65 std::forward_list<int>forwardLst(randNum.begin(), 66 randNum.end()); 67 sumUp(forwardLst,"std::forward_list<int>"); 68 } 69 70 }
The program memoryAccess.cpp
creates first 100 million random numbers between 0 and 100 (line 38). Then it accumulates the elements using a std::vector
(line 50), a std::deque
(line 56), a std::list
(line 61), and a std::forward_list
(line 67). The actual work is done in the function sumUp
(lines 16–32). My educated guess is that GCC, Clang, and the Microsoft Visual Studio compiler use a quite similar implementation of std::accumulate
.
template<class InputIt, class T> T accumulate(InputIt first, InputIt last, T init) { for (; first != last; ++first) { init = init + *first; } return init; }
Consequently, the access time of the elements is the dominant factor for the overall performance. See Figure 9.10.
Figure 9.10 Memory access for sequence containers on Windows
Here are a few observations:
std::vector
is 30 times faster than std::list
or std::forward_list
.
std::vector
is 5 times faster than std::deque
.
std::deque
is 6 times faster than std::list
and std::forward_list
.
std::list
and std::forward_list
are in the same ballpark.
Although I got similar relative numbers on Linux with the GCC compiler, don’t take the performance numbers too seriously. The performance numbers give a solid indication that the access time to the elements heavily depends on the cache-line friendliness of the container.
The section Metaprogramming in Chapter 13 provides an introduction to template metaprogramming, the type-traits library, and constexpr
functions as a means to move computation from run time to compile time.