Chapter 10

Concurrency

Images

Cippi’s challenges with threads

The C++ Core Guidelines list about thirty rules for concurrency that focus on three main goals:

The rules consist of general guidelines, targeted to a non-expert audience, applied to concurrency, parallelism, message passing, and lock-free programming.

General guidelines

Although the rules of this section have a general focus, all of them are important.

CP.1

Assume that your code will run as part of a multi-threaded program

Maybe you are surprised to read this rule. Why should I optimize for a special case? To make it clear, this rule is mainly about code used in libraries. And experience shows that library code is often reused. This reuse means that the code is likely to end up being exercised in a multithreaded program.

The code snippet shows an example from the guidelines.

 1   double cached_computation(double x) {
 2      static double cached_x = 0.0;
 3      static double cached_result = COMPUTATION_OF_ZERO;
 4      double result;
 5

 6      if (cached_x == x) return cached_result;
 7      result = computation(x);
 8      cached_x = x;
 9      cached_result = result;
10      return result;
11   }

The function cached_computation is fine if it runs in a single-threaded environment. This observation does not hold for a multithreading environment because the static variables cached_x (lines 2, 6, and 9) and cached_result (lines 3, 6, and 9) can be modified simultaneously by various threads.

Unsynchronized reading and writing of a shared non-atomic variable is a data race. Consequently, your program has undefined behavior.

What are your options to get rid of the data race?

  1. Use one lock to protect the entire critical region.

  2. Protect the call to the function cached_computation by a lock.

  3. Make both static variables thread_local. thread_local guarantees that each thread gets its variable cached_x and cached_result. Such a static variable is bound to the lifetime of the main thread; a thread_local variable is bound to the lifetime of its thread.

std::mutex m;
double cached_computation(double x) {                  // (1)
   static double cached_x = 0.0;
   static double cached_result = COMPUTATION_OF_ZERO;
   double result;
   {
      std::lock_guard<std::mutex> lck(m);
   if (cached_x == x) return cached_result;
   result = computation(x);
   cached_x = x;
   cached_result = result;
    }
    return result;
}
   
std::mutex cachedComputationMutex;                  // (2)
{

   std::lock_guard<std::mutex> lck(cachedComputationMutex);
   auto cached = cached_computation(3.33);
}
   
   
double cached_computation(double x) {            // (3)
   thread_local double cached_x = 0.0;
   thread_local double cached_result = COMPUTATION_OF_ZERO;
   double result;
   
   if (cached_x == x) return cached_result;
   result = computation(x);
   cached_x = x;
   cached_result = result;
   return result;
}

First, the C++11 standard guarantees that the C++ run time initializes static variables in a thread-safe way; therefore, you do not need to protect their initialization.

  1. This version uses a coarse-grained locking approach. Usually, you should not use coarse-grained locking such as in this version, but maybe in this use case, it is acceptable.

  2. This version is the most coarse-grained solution because the entire function is locked. Of course, the downside is that the user of the function is responsible for the synchronization. In general, that is a bad idea.

  3. Just make the static variables thread_local, and you are done.

CP.2

Avoid data races

First of all, what is a data race?

  • A data race is a situation in which at least two threads access a non-atomic shared variable without synchronization and at least one thread tries to modify the variable.

The rest is simple. If you have a data race in your program, your program has undefined behavior.

If you read the definition of data race carefully, you will notice that a shared, mutable state is necessary for having a data race. Figure 10.1 displays this critical observation. You should avoid the bottom right quadrant, in particular, in a concurrent environment.

Images

Figure 10.1 Four categories of variables

Let me show you a simple data race.

// dataRace.cpp
   
#include <future>
   
int getUniqueId() {
   static int id = 1;
   return id++;
}
   
int main() {
   
   auto fut1 = std::async([]{ return getUniqueId(); });
   auto fut2 = std::async([]{ return getUniqueId(); }); 
   
   auto id = fut1.get();
   auto id2= fut2.get();
   
}

What can go wrong? For example id++ is a read-modify-write operation. Even if each of the three operations read, modify, and write were atomic, the read-modify-write operation is not atomic. The effect of this data race would be with high probability that ids would not be unique.

CP.3

Minimize explicit sharing of writable data

Based on the previous rule related to data races, your shared data should be constant.

Now the only challenge left to solve is how to initialize the constant shared data in a thread-safe way. C++11 supports a few ways to achieve this.

  1. Initialize your data before you start a thread. This is not due to C++11 but is often quite easy to apply.

    const std::unordered_map<std::string, int> val = {
       {"Grimm",1966},
       {"Smith",1968},
       {"Blac",1930} };
    std::thread t1([&tele] { .... });
    std::thread t2([&tele] { .... });
  2. Use constant expressions because they are initialized at compile time.

    constexpr auto doub = 5.1;
  3. Use the function std::call_once in combination with the std::once_flag. You can put the important initialization stuff into the function onlyOnceFunc. The C++ run time guarantees that this function runs exactly once successfully.

    std::once_flag onceFlag;
    
    
    void do_once() {
       std::call_once(onceFlag, []{ 
          std::cout << "Important initialization"   << '\n'; 
       };
    
    }
    ...
    std::thread t1(do_once);
    std::thread t2(do_once);
    std::thread t3(do_once);
    std::thread t4(do_once);
  4. Use static variables with block scope because the C++11 run time guarantees that they are initialized in a thread-safe way.

    void func() {
       .... 
       static int val = 2011; 
    
       ....
    }
    ...
    
    std::thread t1{ func() };
    std::thread t2{ func() };

CP.4

Think in terms of tasks, rather than threads

What is a task? “Task” is a general term for a unit of execution. Since C++11, we have used task as a special term, which stands for two components: a promise and a future. A promise produces a value that the future can asynchronously pick up. Promise and future can run in different threads and are connected by a secure data channel.

Promise exists in three variations in C++: std::async, std::packaged_task, and std::promise. To get more details on tasks, consult the blog posts I’ve written here: https://www.modernescpp.com/index.php/tag/tasks.

A std::packaged_task and a std::promise have in common that they are quite low level; therefore, I present std::async.

Here are a thread and a future/promise pair that calculate the sum of 3 + 4.

// thread
int res;
std::thread t([&]{ res = 3 + 4; });
t.join();
std::cout << res << '\n';
   
// task
auto fut = std::async([]{ return 3 + 4; });
std::cout << fut.get() << '\n';

What is the fundamental difference between a thread and a future/promise pair? A thread is about how something should be calculated; a task is about what should be calculated.

Let me be more specific.

  • The thread t uses the shared variable res to provide its results. In contrast, the promise std::async uses a secure data channel established by the promise to communicate its result to the future fut. This sharing means for the thread t that you have to protect res.

  • In case of a thread, you explicitly create a thread. This creation of a thread does not hold automatically for the promise std::async. You specify what should be calculated and not how it should be calculated. The C++ run time decides to create a thread, if necessary.

CP.8

Don’t try to use volatile for synchronization

If you want to have an atomic in Java or C#, you declare it as volatile. You may think that you could do the same in C++. That is wrong. volatile has no multithreading semantics in C++. Atomics are called std::atomic in C++11.

Now, you may be curious: What is the meaning of volatile in C++?

volatile is meant to be used for special objects, on which optimized read or write operations are not allowed. volatile is typically used in embedded programming domains to denote objects that can change independently of the regular program flow. These are, for example, objects that represent an external device (memory-mapped I/O). Because these objects can change independently of the regular program flow, their value is directly written to main memory. Consequently, there is no optimized storing of values in hardware caches.

CP.9

Whenever feasible use tools to validate your concurrent code

This rule is probably one of the most important ones, and I fully agree.

My students write lots of bugs; in fact, even many of my own programs contain bugs! How can I be sure? Because of the dynamic code analysis tool ThreadSanitizer and the static code analysis tool CppMem. The use cases for ThreadSanitizer and CppMem are different.

ThreadSanitizer gives you the big picture and detects if the execution of your program has a data race. CppMem gives you detailed insight into small pieces of your code, most of the time including atomics. You get the answer to the question, Which interleavings are possible according to the memory model?

Let’s start with ThreadSanitizer.

ThreadSanitizer

Here is the official introduction to ThreadSanitizer from “ThreadSanitizerCppManual” (https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual): “ThreadSanitizer (aka TSan) is a data race detector for C/C++. Data races are one of the most common and hardest bugs in concurrent systems. A data race occurs when two threads access the same non-atomic variable concurrently, and at least one of the accesses is a write. C++11 standard officially bans data races as undefined behavior.”

ThreadSanitizer is part of Clang 3.2 and GCC 4.8. To use it, you have to compile and link using the -fsanitize=thread option and use at least optimization level -O2 and the flag -g for producing debugging information: -fsanitize=thread -O2 -g.

The run-time overhead is significant: The memory usage may increase 5 to 10 times and the execution time 2 to 20 times. Of course, you know the outstanding law of software development: First, make your program correct; then, make it fast.

Now let’s see ThreadSanitizer in action. Here is a typical exercise I have often given in my multithreading classes to condition variables:

Write a small ping-pong game.

Two threads should alternatively set a bool value to true or false. One thread sets the value to true and notifies the other thread. The other thread sets the value to false and notifies the original thread. That play should end after a fixed amount of iterations.

And this is the typical implementation my students come up with.

 1 // conditionVariablePingPong.cpp
 2 
 3 #include <condition_variable>
 4 #include <iostream>
 5 #include <thread>
 6 
 7 bool dataReady= false; 
 8 
 9 std::mutex mut;
10 std::condition_variable condVar1;
11 std::condition_variable condVar2;
12 
13 int counter = 0;
14 int COUNTLIMIT = 50;
15 
16 void setTrue() { 
17 
18   while(counter <= COUNTLIMIT) { 

19      std::unique_lock<std::mutex> lck(mut);
20      condVar1.wait(lck, []{return dataReady == false;}); 
21      dataReady = true;
22      ++counter; 
23      std::cout << dataReady << '\n';
24      condVar2.notify_one(); 
25   }
26 }
27 
28 void setFalse() { 
29 
30   while(counter < COUNTLIMIT) { 
31      std::unique_lock<std::mutex> lck(mut);
32      condVar2.wait(lck, []{return dataReady == true;});
33      dataReady = false;
34      std::cout << dataReady << '\n';
35      condVar1.notify_one();
36   }
37 }
38 
39 int main() {
40 
41   std::cout << std::boolalpha << '\n';
42 
43   std::cout << "Begin: " << dataReady << '\n';
44 
45   std::thread t1(setTrue);
46   std::thread t2(setFalse);
47 
48   t1.join();
49   t2.join();
50 
51   dataReady = false;
52   std::cout << "End: " << dataReady << '\n';
53 
54   std::cout << '\n';
55 
56 }

Function setTrue (line 16) sets the boolean value dataReady (line 21) to true, and function setFalse (line 28) sets it to false. The play starts with setTrue. The condition variable in the function waits for the notification and therefore first checks the boolean dataReady (line 20). Afterward, the function increments the counter (line 22) and notifies the other thread with the help of the condition variable condVar2 (line 24). The function setFalse follows the same workflow. If the counter becomes equal to COUNTLIMIT (line 18), the game ends. Fine? No!

There is a data race on the counter. It is read (line 30) and written (line 22) with-out synchronization. ThreadSanitizer shows the data race (see Figure 10.2).

Images

Figure 10.2 Data race detection with ThreadSanitizer

ThreadSanitizer detects data races during run time; CppMem lets you analyze small code snippets.

CppMem

Here is a short overview of CppMem.

The online tool, which you can also install on your PC, provides very valuable services.

  1. CppMem validates small code snippets, typically including atomics.

  2. The very accurate analysis of CppMem gives you deep insight into the C++ memory model.

For more in-depth insight, read my blog posts about CppMem at http://www.modernescpp.com/index.php/tag/cppmem. For now, I refer to the first point and provide you a thousand-foot view of CppMem.

My overview uses the default configuration of the tool. This overview should give you the basis for further experiments.

For simplicity reasons, I refer to the red numbers in Figure 10.3.

Images

Figure 10.3 Overview of CppMem

  1. Model

    • This specifies the C++ memory model. The C++ memory model is preferred.

  2. Program

    • This is the executable program in C or C++ – like syntax.

    • CppMem offers you a bunch of typical interleavings of atomics. To get the details of these programs, read the very well-written article “Mathematizing C++ Concurrency” by Mark Batty et al. at http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf. Of course, you can also use your code.

    • CppMem is about multithreading, so there are a few simplifications.

      • You can easily define two threads by the symbols {{{ ... ||| ... }}}. The three dots ( ... ) stand for the work package of the thread.

  3. Display Relations

    • This describes the relations between the read, write, and read-write modifications on atomic operations, fences, and locks.

    • If the relation is enabled, it is displayed in the annotated graph (see point 6).

      • sb: sequenced-before

      • rf: read from

      • mo: modification order

      • sc: sequential consistency

      • lo: lock order

      • sw: synchronizes-with

      • dob: dependency-ordered-before

      • data_races

  4. Display Layout

    • You can choose with this switch which Doxygraph graph is used.

  5. Choose the Executions

    • Switch between the various consistent executions.

  6. Annotated Graph

    • This displays the annotated graph.

Now, let’s try it out.

The program dataRaceOnX.cpp has a data race on the int variable x. y is an atomic and, therefore, fine from a concurrency perspective.

// dataRaceOnX.cpp
   
#include <atomic>
#include <iostream>
#include <thread>

int x = 0;
std::atomic<int> y{0};
   
void writing() { 
   x = 2000; 
   y.store(11); 
}
   
void reading(){ 
   std::cout << y.load() << " "; 
   std::cout << x << '\n'; 
}
   
int main() {
   
   std::thread thread1(writing);
   std::thread thread2(reading);
   
   thread1.join();
   thread2.join();
   
}

In order to use CppMem, you must rewrite your C++ program in the dialect of C expected by CppMem’s parser. Cutting and pasting standard C++ code fails with a cryptic “Frontc.ParseError.” Here is the equivalent program written in the more concise CppMem syntax.

// dataRaceOnXCppMem.txt
   
int main(){
   int x = 0;
   atomic_int y = 0;
   
   {{{ 
     {
       x = 2000;
       y.store(11);
     }
   |||

     {
       y.load();
       x;
     }
   }}}
}

CppMem shows it immediately. The first consistent execution has a data race on x. See Figure 10.4.

Images

Figure 10.4 A data race in CppMem

You can observe the data race in the graph. It is the yellow edge (dr) between the write operation (x=2000) and the read operation (x=0).

To get more details on CppMem, read the blog posts I’ve written at https://www.modernescpp.com/index.php/tag/cppmem.

Concurrency

Concurrency is a challenging topic. This challenge is due, in particular, to the low-level abstraction we have now at our disposal. Knowing and applying the rules of this section is, therefore, crucial to get a well-defined multithreading program.

Let me classify about fifteen rules into categories related to locks, threads, and condition variables, but also data sharing between threads, resource considerations, and a sometimes overlooked danger.

Locks

NNN stands for No Naked New and means that memory allocation should not be done as a standalone operation but inside a manager object (“R.12: Immediately give the result of an explicit resource allocation to a manager object”). The same observation holds for mutexes. Mutexes should immediately be given to a manager object, which is, in this case, a lock. In modern C++, we have a std::lock_guard, std::unique_lock, std::shared_lock (C++14), or std::scoped_lock (C++17). Consequently, keep the acronym NNM, which stands for No Naked Mutex, in mind. Locks implement the RAII idiom. The crucial idea behind the RAII idiom is to bind the lifetime of a resource to the lifetime of a local variable. C++ automatically manages the lifetime of locals.

CP.20

Use RAII, never plain lock()/unlock()

The small code snippet should make the value of a lock immediately clear.

std::mutex mtx;
   
void do_stuff() {
   mtx.lock();
   // ... do stuff ... (1)
   mtx.unlock();
}

It doesn’t matter if an exception occurs in (1) or you just forgot to unlock the mtx; in both cases, you will get a deadlock if another thread wants to acquire (lock) the std::mutex mtx. Locks come to the rescue.

std::mutex mtx;
   
void do_stuff() {
   std::lock_guard<std::mutex> lck {mtx};
   // ... do stuff ...
}

Put the mutex into a lock, and the mutex is automatically locked in the constructor of the std::lock_guard and unlocked when lck goes out of scope.

CP.21

Use std::lock() or std::scoped_lock to acquire multiple mutexes

If a thread needs more than one mutex at a time, you have to be extremely careful that you always lock the mutexes in the same sequence. If not, a bad interleaving of threads may cause a deadlock. The following program causes a deadlock.

// lockGuardDeadlock.cpp
   
#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>
   
struct CriticalData {
   std::mutex mut;
};
   
void deadLock(CriticalData& a, CriticalData& b) {
   
   std::lock_guard<std::mutex> guard1(a.mut);               // (1)
   std::cout << "Thread: " << std::this_thread::get_id() << '\n';
   
   std::this_thread::sleep_for(std::chrono::milliseconds(1));
   
   std::lock_guard<std::mutex> guard2(b.mut);               // (1)
   std::cout << "Thread: " << std::this_thread::get_id() << '\n';
   
   // do something with a and b (critical region)          (2)
}
   
int main() {
   
   std::cout << '\n';
   
   CriticalData c1;
   CriticalData c2;
   
   std::thread t1([&]{deadLock(c1, c2);}); 
   std::thread t2([&]{deadLock(c2, c1);}); 

   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Threads t1 and t2 need two CriticalDatas, to perform their jobs in (2). Critical-Data has its mutex mut to synchronize the access. Unfortunately, both invoke the function deadlock with the arguments c1 and c2 in a different sequence (1). Now we have a race condition that could end up in a deadlock. Thread t1 can lock the first mutex a.mut but not the second one b.mut because, in the meantime, thread t2 locked the second one. See Figure 10.5.

Images

Figure 10.5 A deadlock due to multiple locked mutexes

The easiest way to solve the deadlock is to lock both mutexes atomically.

With C++11, you can use a std::unique_lock together with std::lock. Thanks to the tag std::defer_lock, the std::unique_lock takes the mutex without locking it. The locking finally takes place in the std::lock call. std::lock can take an arbitrary number of arguments.

void deadLock(CriticalData& a, CriticalData& b) {
   std::unique_lock<mutex> guard1(a.mut, std::defer_lock);
   std::unique_lock<mutex> guard2(b.mut, std::defer_lock);
   std::lock(guard1, guard2);
   // do something with a and b (critical region)
}

With C++17, a std::scoped_lock can lock an arbitrary number of mutexes in an atomic operation.

void deadLock(CriticalData& a, CriticalData& b) {
   std::scoped_lock scoLock(a.mut, b.mut);

   // do something with a and b (critical region)
}

CP.22

Never call unknown code while holding a lock (e.g., a callback)

Why is this code snippet bad, and why should it not pass a code review?

std::mutex m; 
{
   std::lock_guard<std::mutex> lockGuard(m);
   sharedVariable = unknownFunction();
}

I can only speculate about the unknownFunction. If unknownFunction

  • Tries to lock the mutex m, you get undefined behavior. Most of the time, the result of this undefined behavior is a deadlock.

  • Starts a new thread that tries to lock the mutex m, you get a deadlock.

  • Locks another mutex m2, you may get a deadlock because you lock the two mutexes m and m2 at the same time. Another thread may lock the same mutexes in a different sequence.

  • Does not directly or indirectly try to lock the mutex m, all seems to be fine. “Seems” because a coworker can modify the function, and you get a changed version of the function unknownFunction. Now all bets are off.

  • Works as expected, you still may have a performance issue because the unknownFunction takes quite a while. What was meant to be a multithreaded program behaves similarly to a single-threaded program.

To overcome the issues, use a local variable and invoke the unknown function outside of the critical region.

std::mutex m;
auto tempVar = unknownFunction();
{
   std::lock_guard<std::mutex> lockGuard(m);
   sharedVariable = tempVar;
}

This additional indirection solves all issues. tempVar is a local variable and can, therefore, not be a victim of a data race. No victim means that you can invoke unknownFunction without a synchronization mechanism. Additionally, the time for holding the lock is reduced to its bare minimum: assigning the value of tempVar to sharedVariable.

Threads

Threads are the basic building block for concurrent and parallel programming. With each new C++ standard, threads become more and more an implementation detail for concurrency. For example, with C++17, we got the parallel STL, which allows us to specify the execution policy; with C++20, coroutines; and with C++23, we can hope for transactional memory.

CP.23

Think of a joining thread as a scoped container

and

CP.24

Think of a thread as a global container

The slight variation of the code snippet from the C++ Core Guidelines should make both rules clear:

void f(int* p) {
   // ...
   *p = 99;
   // ...
}
   
int glob = 33;
   
void some_fct(int* p) {                  // (1)
   int x = 77;
   std::thread t0(f, &x);               // OK
   std::thread t1(f, p);                // OK
   std::thread t2(f, &glob);           // OK
   auto q = make_unique<int>(99);
   std::thread t3(f, q.get());         // OK
   // ...
   t0.join();

   t1.join();
   t2.join();
   t3.join();
   // ...
}
   
void some_fct2(int* p) {                 // (2)
   int x = 77;
   std::thread t0(f, &x);               // bad
   std::thread t1(f, p);                // bad
   std::thread t2(f, &glob);            // OK
   auto q = make_unique<int>(99);
   std::thread t3(f, q.get());         // bad
   // ...
   t0.detach();
   t1.detach();
   t2.detach();
   t3.detach();
   // ...
}

The only difference between the functions some_fct (1) and some_fct2 (2) is that the first variation joins its created threads, but the second variation detaches all created threads.

First of all, you have to join or detach the children threads. If you don’t, you get a std::terminate in the destructor of the child thread (see rule “CP.25: Prefer gsl::joining_thread over std::thread”).

The difference between joining or detaching a created thread is the following: When the creator calls a thr.join() call on the created thread thr, it waits until the created thread is done. thr.join() is a synchronization point. To put it the other way around, the child thread thr can use all variables (state) of the enclosing scope in which it was created. Consequently, all calls of the function f are well-defined.

On the contrary, a thr.detach() call does not wait and is, therefore, not a synchronization point. This means that the created thread can outlive its creator. Consequently, using variables of the enclosing scope may not be valid anymore. This is precisely the issue in the function some_fct2. The variable x, the pointer p, or the resource of the std::unique_ptr q may not be valid anymore.

A thread can be seen as a global container using variables from outside. Additionally, in case of a joining thread, the lifetime of the container is scoped.

CP.25

Prefer std::jthread over std::thread

The original title of this rule is “CP.25: Prefer gsl::joining_thread over std::thread.” I replaced gsl::joining_thread from the Guidelines Support Library with std::jthread from C++20.

In the following program, I forgot to join the thread t.

// threadWithoutJoin.cpp
   
#include <iostream>
#include <thread>
   
int main() {
   
   std::thread t([]{ 
      std::cout << std::this_thread::get_id() << '\n';
   });
   
}

The execution of the program ends abruptly. See Figure 10.6.

Images

Figure 10.6 Forgot to join a thread

The lifetime of the created thread t ends with its callable. The creator has two choices. First, it waits until its child is done (t.join()). Second, it detaches itself from its child (t.detach()). A thread t with a callable unit—you can create threads without callable units—is called joinable if neither a t.join() nor t.detach() call happened. The destructor of a joinable thread throws a std::terminate exception, which ends in std::abort. In our case, the program terminated before the child thread had time to display its id.

In addition to a std::thread, a std::jthread automatically joins on destruction. Replacing the std::thread with a std::jthread, therefore, solves the issue.

// threadWithJoin.cpp; C++20
   
#include <iostream>
#include <thread>

int main() {
   
   std::jthread t([]{ 
      std::cout << std::this_thread::get_id() << '\n';
   });
   
}

CP.26

Don’t detach() a thread

This rule sounds strange. The C++11 standard supports detaching a thread, but we should not do it! The reason is that detaching a thread can be quite challenging. For example, have a look at this small program that has undefined behavior. Even objects with static duration can be critical.

// threadDetach.cpp
   
#include <iostream>
#include <string>
#include <thread>
   
void func() {
   std::string s{"C++11"};
   std::thread t([&s]{ std::cout << s << '\n';});
   t.detach();
}
   
int main() {
   func();
}

The lambda takes s by reference. This is undefined behavior because the child thread t uses the variable s, which goes out of scope. Stop! This is the obvious problem, but the hidden issue that many programmers overlook is std::cout. std::cout has static duration. Static duration means the lifetime of std::cout ends with the end of the process, and we have, additionally, a race condition: Thread t may use std::cout at this time.

Race condition: A race condition is a situation in which the result of an operation depends on the interleaving of certain individual operations.

Condition variables

CP.42

Don’t wait without a condition

Condition variables support a quite simple concept. One thread prepares something and sends a notification to another thread that is waiting for it.

Here is the rationale for the rule: “A wait without a condition can miss a wakeup or wake up simply to find that there is no work to do.” What does that mean? Condition variables are subject to two very serious issues: lost wakeups and spurious wake-ups. The key concern about condition variables is that they have no memory.

Before I present you with this issue, let me first present the correct way to use condition variables.

// conditionVariable.cpp
   
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <thread>
   
std::mutex mut;
std::condition_variable condVar; 
   
bool dataReady{false};
   
void waitingForWork() {
   std::cout << "Waiting " << '\n';
   std::unique_lock<std::mutex> lck(mut);
   condVar.wait(lck, []{ return dataReady; });                               // (4) 
   std::cout << "Running " << '\n';
}
   
void setDataReady() {
   {
      std::lock_guard<std::mutex> lck(mut);
      dataReady = true;
   }
   std::cout << "Data prepared" << '\n';
   condVar.notify_one();                                                    // (3) 
}
   
int main() {
   
   std::cout << '\n';

   std::thread t1(waitingForWork);                                         // (1) 
   std::thread t2(setDataReady);                                          // (2) 
   
   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

How does the synchronization work? The program has two children threads: t1 and t2. They get their work package waitingForWork and setDataReady (1) and (2). setDataReady sends a notification—using the condition variable condVar—that it is done with the preparation of the work: condVar.notify_one()(3). While holding the lock, thread t1 waits for its notification: condVar.wait(lck, []{ return data-Ready; })(4). The sender and receiver need a lock. In the case of the sender, a std::lock_guard is sufficient because it calls lock and unlock only once. In the case of the receiver, a std::unique_lock is necessary because it usually frequently locks and unlocks its mutex.

Figure 10.7 shows the output of the program.

Images

Figure 10.7 A condition variable in action

Maybe you are wondering, Why do we need a predicate for the wait call since you can invoke wait without a predicate? This workflow seems too complicated for such a simple synchronization of threads.

Now we are back to the missing memory of condition variables and the two phenomena called lost wakeup and spurious wakeup.

  • Lost wakeup: The phenomenon of the lost wakeup is that the sender sends its notification before the receiver begins waiting. The consequence is that the notification is lost.

  • Spurious wakeup: The receiver may wake up even if no notification has been sent. At least POSIX Threads and the Windows API can be victims of these phenomena.

To avoid these two issues, you have to use an additional predicate as memory or, as the rule states it, an additional condition. If you use a condition variable without a predicate, it’s possible that you will get a lost wakeup and, therefore, a deadlock because the waiting thread is waiting for something that never happens.

The following program uses a condition variable without an additional predicate. Let’s see what happens:

// conditionVariableWithoutPredicate.cpp
   
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <thread>
   
std::mutex mut;
std::condition_variable condVar;
   
void waitingForWork() {
   std::cout << "Waiting " << '\n';
   std::unique_lock<std::mutex> lck(mut);
   condVar.wait(lck); 
   std::cout << "Running " << '\n';
}
   
void setDataReady() {
   std::cout << "Data prepared" << '\n';
   condVar.notify_one(); 
}
   
int main() {
   
   std::cout << '\n';
   
   std::thread t1(waitingForWork);
   std::thread t2(setDataReady);

   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Whenever the notification thread t2 runs before the waiting thread t1, the notification is lost. The second execution shows this phenomenon that results in a deadlock. See Figure 10.8.

Images

Figure 10.8 A condition variable without a predicate

Data sharing

The less you share data, and the more you work with local variables, the better. Sometimes, though, you have no other option but to share data. For example, the child thread wants to communicate its work to its parent thread.

CP.31

Pass small amounts of data between threads by value, rather than by reference or pointer

Passing data to a thread by value immediately gives you two benefits:

  1. There is no sharing, and therefore, no data race is possible. The requirement for a data race is a mutable, shared state.

  2. You don’t have to be concerned about the lifetime of the data. The data stays alive for the lifetime of the created thread.

Of course, the crucial question is, What does a small amount of data mean? The C++ Core Guidelines are not clear about this point. In the rule “F.16: For ‘in’ parameters, pass cheaply-copied types by value and others by reference to,” the C++ Core Guidelines state that 4 * sizeof(int) is a rule of thumb for functions. This means that smaller than 4 * sizeof(int) should be passed by value, bigger than 4 * sizeof(int) by reference or pointer.

In the end, you must measure the performance of your program if necessary.

CP.32

To share ownership between unrelated threads use shared_ptr

Imagine that you have an object that you want to share between unrelated threads. Unrelated means that there is no data race on that object. The key question is, Who is the owner of the object and, therefore, responsible for releasing the memory? Now you can choose between a memory leak if you don’t deallocate the memory or undefined behavior if you invoke delete more than once. Most of the time, the undefined behavior causes a run-time crash.

// threadSharesOwnership.cpp
   
#include <iostream>
#include <thread>
   
using namespace std::literals::chrono_literals;
   
struct MyInt {
   int val{2017};
   ~MyInt() {                                                                 // (4)
      std::cout << "Goodbye" << '\n'; 
   }
};
   
void showNumber(const MyInt* myInt) {
   std::cout << myInt->val << '\n';
}
   
void threadCreator() {
   MyInt* tmpInt= new MyInt;                                                  // (1)
   
   std::thread t1(showNumber, tmpInt);                                        // (2)
   std::thread t2(showNumber, tmpInt);                                        // (3)

   t1.detach();
   t2.detach();
}
   
int main() {
   
   std::cout << '\n';
   
   threadCreator();
   std::this_thread::sleep_for(1s);
   
   std::cout << '\n';
   
}

The example is intentionally simple. I let the main thread sleep for one second to be sure that it outlived the lifetime of the children threads t1 and t2. This is, of course, not an appropriate synchronization, but it helps to make my point. The vital issue of the program is, Who is responsible for the deletion of tmpInt (1)? Thread t1 (2), thread t2 (3), or the function (main thread) itself? Because I cannot forecast how long each thread runs, I decided to go with a memory leak. Consequently, the destructor of MyInt (4) is never called (see Figure 10.9).

Images

Figure 10.9 Shared ownership using a pointer

The lifetime issues are easy to handle if I use a std::shared_ptr.

// threadSharesOwnershipSharedPtr.cpp
   
#include <iostream>
#include <memory>
#include <thread>

using namespace std::literals::chrono_literals;
   
struct MyInt {
   int val{2017};
   ~MyInt() {
      std::cout << "Goodbye" << '\n';
   }
};
   
void showNumber(std::shared_ptr<MyInt> myInt) { 
   std::cout << myInt->val << '\n';
}
   
void threadCreator() {
   auto sharedPtr = std::make_shared<MyInt>(); // (1) 
   
   std::thread t1(showNumber, sharedPtr);
   std::thread t2(showNumber, sharedPtr);
   
   t1.detach();
   t2.detach();
}
   
int main() {
   
   std::cout << '\n';
   
   threadCreator();
   std::this_thread::sleep_for(1s);
   
   std::cout << '\n';
   
}

Two small changes to the source code are necessary. First, the pointer in (1) becomes a std::shared_ptr, and second, the function showNumber takes a smart pointer instead of a raw pointer. I assume for simplicity reasons that the threads t1 and t2 are done within a second. See Figure 10.10.

Images

Figure 10.10 Shared ownership using a smart pointer

Resources

One of the main reasons for using concurrency is performance. You have to keep in mind that using threads requires resources: time and memory. The resource usage begins with creation, continues with context switching from user space, to kernel space, and ends with the destruction of a thread. Additionally, a thread has its own state that has to be allocated and maintained.

CP.40

Minimize context switching

and

CP.41

Minimize thread creation and destruction

How expensive is a thread? The answer to this question is the reason for this rule. Let me first talk about the usual size of a thread and then about the costs of its creation.

Size

A std::thread is a thin wrapper around the native thread, and I’m, therefore, interested in the size of a Windows thread and a POSIX thread.

Table 10.1 Typical thread size

Architecture

Default stack size

i386

2MB

IA-64

32MB

PowerPC

4MB

S/390

2MB

Sparc-32

2MB

Sparc-64

4MB

x86_64

2MB

Creation

I didn’t find numbers on how much time it takes to create a thread. To get a gut feeling, I made a simple performance test on Linux and Windows. Don’t use the numbers to compare Linux and Windows. This is not the point of this experiment.

I used GCC 6.2.1 on a desktop and Microsoft Visual Studio 2017 on a laptop for my performance tests. I compiled the programs with maximum optimization on both platforms.

Here is the small test program.

// threadCreationPerformance.cpp
   
#include <chrono>
#include <iostream>
#include <thread>
   
constexpr long long numThreads= 1'000'000;
   
int main() {
   
   auto start = std::chrono::system_clock::now();
   
   for (long long i = 0; i < numThreads; ++i) {           // (1)
      std::thread([]{}).detach(); 
   }
   
   std::chrono::duration<double> dur = 
      std::chrono::system_clock::now() - start;
   
   std::cout << "time: " << dur.count() 
   
}

The program creates one million threads that execute an empty lambda function (1). Figures 10.11 and 10.12 show the results for Linux and Windows.

Images

Figure 10.11 Thread creation on Linux

Images

Figure 10.12 Thread creation on Windows

This means that the creation of a single thread took about 14.5 seconds / 1000000 = 14.5 microseconds on Linux and about 44 seconds / 1000000 = 44 microseconds on Windows.

To put it another way, in one second, you can create about 69,000 threads on Linux and 23,000 threads on Windows.

CP.43

Minimize time spent in a critical section

The less time you lock a mutex, the more time other threads can run. Let’s take the example of the notification of a condition variable.

void setDataReady() {
   std::lock_guard<std::mutex> lck(mut);
   dataReady = true;                               // (1) 
   std::cout << "Data prepared" << '\n'; 
   condVar.notify_one(); 
}

The mutex mut is locked at the beginning of the function and unlocked at the end of the function. This locking is not necessary. Only the expression dataReady = true (1) has to be protected.

First, std::cout is thread safe. The C++11 standard guarantees that each character is written in an atomic step and the right sequence. Second, the notification condVar.notify_one() is thread safe.

Here is the improved version of the function setDataReady:

void setDataReady() {
   { 
      std::lock_guard<std::mutex> lck(mut);
      dataReady = true;
   } 
   std::cout << "Data prepared" << '\n';
   condVar.notify_one(); 
}

Overlooked danger

CP.44

Remember to name your lock_guards and unique_locks

When you don’t name the std::lock_guard or the std::unique_lock, you just create a temporary that is created and immediately destroyed. The std::lock_guard or std::unique_lock automatically locks its mutex and its constructor and unlocks it in its destructor. This pattern is called RAII.

My small example shows the conceptual behavior of a std::lock_guard. Its big brother std::unique_lock supports more operations.

// myGuard.cpp
   
#include <mutex>
#include <iostream>
   
template <typename T>
class MyGuard {
public:
   explicit MyGuard(T& m): myMutex(m) {
      std::cout << "lock" << '\n';
      myMutex.lock();
   }
   ~MyGuard() {
      myMutex.unlock();
      std::cout << "unlock" << '\n';
   }
private:
   T& myMutex;
};
   
int main() {
   
   std::cout << '\n';
   
   std::mutex m;
   MyGuard<std::mutex> {m};                                          // (1) oops!
   std::cout << "CRITICAL SECTION" << '\n';                         // (2)
   
   std::cout << '\n';
   
}                                                                 // (3)

MyGuard calls lock and unlock in its constructor and its destructor. Because of the temporary, the call to the constructor and destructor happens in (1) and not, as usual, in line (3). As a consequence, the critical section in line (2) is executed unprotected.

This execution of the program shows that the output of the message unlock happens before the output of the message CRITICAL SECTION. See Figure 10.13.

Images

Figure 10.13 Using a temporary lock std::lock_guard

By giving the unnamed MyGuard MyGuard<std::mutex> {m}; (1) a name MyGuard<std::mutex> myGuard{m};, the critical section becomes protected. See Figure 10.14.

Images

Figure 10.14 Using a named temporary std::lock_guard

Parallelism

Besides the title, there is no content regarding parallelism in the C++ Core Guidelines. To fill the gap, I have provided a short introduction to the parallel STL and a rule. First of all, here is my rule.

Prefer the parallel algorithms of the STL to handcrafted solutions with threads.

The idea is quite simple. The Standard Template Library has more than 100 algorithms for searching, counting, and manipulating of ranges and their elements. With C++17, 69 of them are overloaded and a few new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. By using the execution policy, you can specify whether the algorithm should run sequential, parallel, or parallel and vectorized.

  • std::execution::seq: runs the algorithm sequentially

  • std::execution::par: runs the algorithm in parallel on multiple threads

  • std::execution::par_unseq: runs the algorithm in parallel on multiple threads and allows the interleaving of individual loops

Vectorization (std::execution::par_unseq) stands for the SIMD (Single Instruction, Multiple Data) extensions of the instruction set of a modern processor. SIMD enables your processor to execute one operation in parallel on multiple data.

With the execution policy tag, you can choose which variant of an algorithm should be performed. The tag is not binding but is a strong hint to the C++ run time.

std::vector<int> v = {5, -3, 10, -5, -10, 22, 0};
   
// standard sequential sort
std::sort(v.begin(), v.end());
   
// sequential execution
std::sort(std::execution::seq, v.begin(), v.end());
   
// permitting parallel execution
std::sort(std::execution::par, v.begin(), v.end());
   
// permitting parallel and vectorized execution
std::sort(std::execution::par_unseq, v.begin(), v.end())

Sixty-nine of the algorithms of the STL support a parallel or a parallel and vectorized execution. See Table 10.2.

Table 10.2 Algorithms of the STL for which parallel versions are available (the std namespace is omitted)

adjacent_difference

is_heap_until

replace_copy_if

adjacent_find

is_partitioned

replace_if

all_of

is_sorted

reverse

any_of

is_sorted_until

reverse_copy

copy

lexicographical_compare

rotate

copy_if

max_element

rotate_copy

copy_n

merge

search

count

min_element

search_n

count_if

minmax_element

set_difference

equal

mismatch

set_intersection

fill

move

set_symmetric_difference

fill_n

none_of

set_union

find

nth_element

sort

find_end

partial_sort

stable_partition

find_first_of

partial_sort_copy

stable_sort

find_if

partition

swap_ranges

find_if_not

partition_copy

transform

generate

remove

uninitialized_copy

generate_n

remove_copy

uninitialized_copy_n

includes

remove_copy_if

uninitialized_fill

inner_product

remove_if

uninitialized_fill_n

inplace_merge

replace

unique

is_heap

replace_copy

unique_copy

Additionally, eight new algorithms are added with C++17.

std::for_each
std::for_each_n
std::exclusive_scan
std::inclusive_scan
std::transform_exclusive_scan
std::transform_inclusive_scan
std::reduce
std::transform_reduce

The following example shows the usage of the std::transform_exclusive_scan algorithm.

// transformExclusiveScan.cpp; C++17 with MSVC
   
#include <execution>
#include <numeric>
#include <iostream>
#include <vector>
   
int main() {

   std::cout << '\n';
   
   std::vector<int> resVec{1, 2, 3, 4, 5, 6, 7, 8, 9};
   std::vector<int> resVec1(resVec.size()); 
   std::transform_exclusive_scan(std::execution::par, 
                            resVec.begin(), resVec.end(), 
                            resVec1.begin(), 0,
                            [](int fir, int sec){ return fir + sec; },
                            [](int arg){ return arg * arg; });
   
   std::cout << "transform_exclusive_scan: ";
   for (auto v: resVec1) std::cout << v << " ";
   
   std::cout << '\n';
   
}

The std::transform_exclusive_scan algorithm is quite challenging to read. Let me try to explain it. In the first step, std::transform_exclusive_scan applies the lambda expression [](int arg){ return arg * arg; } to each element of the range resVec.begin() to resVec.end(). In the second step, the algorithm applies the binary operation [](int fir, int sec){ return fir + sec; } to the intermediate vector. This means the algorithm sums up all elements using 0 as the initial value. The result is placed in resVec1. See Figure 10.15.

Images

Figure 10.15 Usage of std::transform_exclusive_scan

Message passing

The section on message passing has two rules.

Both rules lack content. Therefore, I have to improvise.

A task is the C++-ish way to pass messages between threads. The message can be a value, an exception, or a notification. A task consists of the two components promise and future. Promise exists in three variations in C++: std::async, std::packaged_task, and std::promise. The promise creates the message, which the future picks up asynchronously.

I already gave a short example of std::async in the rule “CP.4: Think in terms of tasks, rather than threads” to send a value from the promise to the future. In this section, I use a std::promise as the sender.

Sending a value, or an exception

In contrast to a thread, the promise and the associated future share a secure channel. In the following example, one promise sends a value, and one promise sends an exception.

// promiseFutureException.cpp
   
#include <exception>
#include <future>
#include <iostream>
#include <thread>
#include <utility>
   
struct Div {
   void operator()(std::promise<int> intPromise, int a, int b) const {
      try {                                                      // (4)
         if (b == 0) {
            std::string err = "Illegal division by zero: " + 
                              std::to_string(a) + "/" + std::to_string(b);
           throw std::runtime_error(err);
       }
       intPromise.set_value(a / b);                             // (2)
      }
      catch ( ... ) {
         intPromise.set_exception(std::current_exception());   // (1)
      }
   }
};
   
void executeDivision(int nom, int denom) {
   std::promise<int> divPromise;
   std::future<int> divResult= divPromise.get_future();

   Div div;
   std::thread divThread(div, std::move(divPromise), nom, denom);
   
   // get the result or the exception                        // (5)
   try {
      std::cout << nom << "/" << denom << " = " 
   }
   catch (std::runtime_error& e){ 
      std::cout << e.what() << '\n';
   }
   
   divThread.join();
}
   
int main() {
   
   std::cout << '\n';
   
   executeDivision(20, 0);
   executeDivision(20, 10);
   
   std::cout << '\n';
   
}

If the callable used by std::promise throws an error, the exception is stored in the shared state. When the future divResult then calls divResult.get() (3), the exception is rethrown, and the associated future has to handle it. The std::promise prom set the exception via prom.set_value(std::current_exception()) (1) and the value via divPromise.set_value (2) as the shared state. As the promise in (4), the future has to deal with the exception in its try-catch block (5). Dividing a number by 0 is undefined behavior. The function executeDivision displays the result of the calculation or the exception. See Figure 10.16.

Images

Figure 10.16 A value and an exception as message

Sending a notification

If you use promises and futures (in short tasks) to synchronize threads, they have a lot in common with condition variables. Most of the time, promises and futures are the safer choice than condition variables.

Before I present you with an example, Table 10.3 shows the big picture.

Table 10.3 Condition variables versus tasks

Criteria

Condition variables

Tasks

Multiple synchronizations

Yes

No

Critical section

Yes

No

Spurious wakeup

Yes

No

Lost wakeup

Yes

No

The advantage of a condition variable with respect to a promise and future is that you can use condition variables to synchronize threads multiple times. In contrast to that, a promise can send its notification only once. If you use a condition variable for only one synchronization, the condition variable is a lot more challenging to use correctly. A promise and future pair needs no locks and is not prone to spurious or lost wakeups, and there are no critical sections or additional conditional.

// promiseFutureSynchronize.cpp
   
#include <future>
#include <iostream>
#include <utility>
   
void waitingForWork(std::future<void> fut) {
   std::cout << "Waiting " << '\n';
   fut.wait();                                                                       // (5)
   std::cout << "Running " << '\n';
}
   
void setDataReady(std::promise<void> prom) {
   std::cout << "Data prepared" << '\n';
   prom.set_value();                                                                // (6)
}
   
int main() {
   
   std::cout << '\n';

   std::promise<void> sendReady;                                                    // (1)
   auto fut = sendReady.get_future();                                               // (2)
   
   std::thread t1(waitingForWork, std::move(fut));                                 // (3)
   std::thread t2(setDataReady, std::move(sendReady));                             // (4)
   
   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Thanks to sendReady (1), you get a future fut (2). Both communication endpoints are moved into threads t1 (3) and t2 (4). The future waits using the call fut.wait() (5), and it gets the notification of the associated promise: prom.set_value() (6).

The structure and the output of the program match the corresponding program in the rule for condition variables: “C.42: Don’t wait without a condition.” See Figure 10.17.

Images

Figure 10.17 Notifications with a task

Lock-free programming

The rules on concurrency and parallelism target non-experts. Lock-free programming is an experts-only topic. Consequently, there are only a few short rules to lock-free programming.

CP.100

Don’t use lock-free programming unless you absolutely have to

This rule is the most critical meta-rule to lock-free programming. If you don’t believe me, here are a few quotes from talks given by worldwide recognized experts in this particular domain.

  • Herb Sutter: “Lock-free programming is like playing with knives” (CppCon 2014).

  • Anthony Williams: “Lock-free programming is about how to shoot yourself in the foot” (NDC 2016).

  • Tony Van Eerd: “Lock-free coding is the last thing you want to do” (NDC 2016).

  • Fedor Pikus: “Writing lock-free programs is hard. Writing correct lock-free programs is even harder” (NDC 2018).

CP.101

Distrust your hardware/compiler combination

What does “distrust your hardware/compiler combination” mean? Let me put it another way: When you break the sequential consistency, you also break your intuition with high probability. Let me start with a simple program.

// sequentialConsistency.cpp
   
#include <atomic>
#include <iostream>
#include <thread>
   
std::atomic<int> x{0};
std::atomic<int> y{0};
   
void writing(){ 
   x.store(2000);                   // (1)
   y.store(11);                    // (2)
}
   
void reading(){ 
   std::cout << y.load() << " ";   // (3)
   std::cout << x.load() << '\n';  // (4) 
}

int main(){
   std::thread thread1(writing);
   std::thread thread2(reading);
   thread1.join();
   thread2.join();
}

I have a question about a short example: Which values for y and x are possible in (3) and (4)? Because y and x are atomic, no data race is possible. I further don’t specify the memory ordering; therefore, sequential consistency applies. Sequential consistency means the following:

  • Each thread performs its operation in the specified sequence: (1) happens before (2), and (3) happens before (4).

  • There is a global order of all operations on all threads. To put it the other way around, each thread sees all operations in the same sequence.

If you combine these two properties of the sequential consistency, there is only one combination of x and y not possible: y == 11 and x == 0. Now let me break the sequential consistency and maybe your intuition.

The relaxed semantics is the weakest of all memory orderings. Relaxed semantics essentially boils down to one guarantee: Operations on atomics only guarantee atomicity.

// relaxedSemantic.cpp
   
#include <atomic>
#include <iostream>
#include <thread>
   
std::atomic<int> x{0};
std::atomic<int> y{0};
   
void writing(){ 
   x.store(2000, std::memory_order_relaxed); 
   y.store(11, std::memory_order_relaxed); 
}
   
void reading(){ 
   std::cout << y.load(std::memory_order_relaxed) << " "; 
   std::cout << x.load(std::memory_order_relaxed) << '\n'; 
}

int main(){
   std::thread thread1(writing);
   std::thread thread2(reading);
   thread1.join();
   thread2.join();
}

Two highly unintuitive phenomena can happen. First, thread2 can see the operations of thread1 in a different sequence. Second, thread1 can reorder its instruction because it is not performed on the same atomic. What does this mean for the possible values of x and y? y == 11 and x == 0 is now a possible result. I want to be more specific. Which result is possible depends on your hardware. See Table 10.4.

Table 10.4 Operation reordering on various platforms

Architecture

LoadLoad

LoadStore

StoreLoad

StoreStore

x86, AMD64

Yes

Alpha, IA64, RISC

Yes

Yes

Yes

Yes

For example, operation reordering is quite conservative on x86 or AMD64; stores can be reordered after loads, but on Alpha, IA64, or RISC (ARM) architectures, all four possible reorderings of store and load operations are allowed.

LoadLoad in the table means for one control flow that a load operation on an atomic is followed by a load operation on another atomic. The same argumentation applies to LoadStore, StoreLoad, and StoreStore.

CP.102

Carefully study the literature

Here are a few resources for outstanding literature. Study them first.

Related rules