Chapter 10 Concurrency

Images — *Cippi’s challenges with threads*

The C++ Core Guidelines list about thirty rules for concurrency that focus on three main goals:

To help in writing code that is amenable to being used in a multithreaded environment
To show clean, safe ways to use the threading primitives offered by the standard library
To offer guidance on what to do when concurrency and parallelism aren’t giving the performance gains needed

The rules consist of general guidelines, targeted to a non-expert audience, applied to concurrency, parallelism, message passing, and lock-free programming.

General guidelines

Although the rules of this section have a general focus, all of them are important.

CP.1

Assume that your code will run as part of a multi-threaded program

Maybe you are surprised to read this rule. Why should I optimize for a special case? To make it clear, this rule is mainly about code used in libraries. And experience shows that library code is often reused. This reuse means that the code is likely to end up being exercised in a multithreaded program.

The code snippet shows an example from the guidelines.

Click here to view code image

 1   double cached_computation(double x) {
 2      static double cached_x = 0.0;
 3      static double cached_result = COMPUTATION_OF_ZERO;
 4      double result;
 5

 6      if (cached_x == x) return cached_result;
 7      result = computation(x);
 8      cached_x = x;
 9      cached_result = result;
10      return result;
11   }

The function cached_computation is fine if it runs in a single-threaded environment. This observation does not hold for a multithreading environment because the static variables cached_x (lines 2, 6, and 9) and cached_result (lines 3, 6, and 9) can be modified simultaneously by various threads.

Unsynchronized reading and writing of a shared non-atomic variable is a data race. Consequently, your program has undefined behavior.

What are your options to get rid of the data race?

Use one lock to protect the entire critical region.
Protect the call to the function cached_computation by a lock.
Make both static variables thread_local. thread_local guarantees that each thread gets its variable cached_x and cached_result. Such a static variable is bound to the lifetime of the main thread; a thread_local variable is bound to the lifetime of its thread.

Click here to view code image

std::mutex m;
double cached_computation(double x) {                  // (1)
   static double cached_x = 0.0;
   static double cached_result = COMPUTATION_OF_ZERO;
   double result;
   {
      std::lock_guard<std::mutex> lck(m);
   if (cached_x == x) return cached_result;
   result = computation(x);
   cached_x = x;
   cached_result = result;
    }
    return result;
}
   
std::mutex cachedComputationMutex;                  // (2)
{

   std::lock_guard<std::mutex> lck(cachedComputationMutex);
   auto cached = cached_computation(3.33);
}
   
   
double cached_computation(double x) {            // (3)
   thread_local double cached_x = 0.0;
   thread_local double cached_result = COMPUTATION_OF_ZERO;
   double result;
   
   if (cached_x == x) return cached_result;
   result = computation(x);
   cached_x = x;
   cached_result = result;
   return result;
}

First, the C++11 standard guarantees that the C++ run time initializes static variables in a thread-safe way; therefore, you do not need to protect their initialization.

This version uses a coarse-grained locking approach. Usually, you should not use coarse-grained locking such as in this version, but maybe in this use case, it is acceptable.
This version is the most coarse-grained solution because the entire function is locked. Of course, the downside is that the user of the function is responsible for the synchronization. In general, that is a bad idea.
Just make the static variables thread_local, and you are done.

CP.2

Avoid data races

First of all, what is a data race?

A data race is a situation in which at least two threads access a non-atomic shared variable without synchronization and at least one thread tries to modify the variable.

The rest is simple. If you have a data race in your program, your program has undefined behavior.

If you read the definition of data race carefully, you will notice that a shared, mutable state is necessary for having a data race. Figure 10.1 displays this critical observation. You should avoid the bottom right quadrant, in particular, in a concurrent environment.

Let me show you a simple data race.

Click here to view code image

// dataRace.cpp
   
#include <future>
   
int getUniqueId() {
   static int id = 1;
   return id++;
}
   
int main() {
   
   auto fut1 = std::async([]{ return getUniqueId(); });
   auto fut2 = std::async([]{ return getUniqueId(); }); 
   
   auto id = fut1.get();
   auto id2= fut2.get();
   
}

What can go wrong? For example id++ is a read-modify-write operation. Even if each of the three operations read, modify, and write were atomic, the read-modify-write operation is not atomic. The effect of this data race would be with high probability that ids would not be unique.

CP.3

Minimize explicit sharing of writable data

Based on the previous rule related to data races, your shared data should be constant.

Now the only challenge left to solve is how to initialize the constant shared data in a thread-safe way. C++11 supports a few ways to achieve this.

Initialize your data before you start a thread. This is not due to C++11 but is often quite easy to apply.

Click here to view code image

const std::unordered_map<std::string, int> val = {
   {"Grimm",1966},
   {"Smith",1968},
   {"Blac",1930} };
std::thread t1([&tele] { .... });
std::thread t2([&tele] { .... });

Use constant expressions because they are initialized at compile time.
```
constexpr auto doub = 5.1;
```
Use the function std::call_once in combination with the std::once_flag. You can put the important initialization stuff into the function onlyOnceFunc. The C++ run time guarantees that this function runs exactly once successfully.

Click here to view code image
```
std::once_flag onceFlag;


void do_once() {
   std::call_once(onceFlag, []{ 
      std::cout << "Important initialization"   << '\n'; 
   };

}
...
std::thread t1(do_once);
std::thread t2(do_once);
std::thread t3(do_once);
std::thread t4(do_once);
```
Use static variables with block scope because the C++11 run time guarantees that they are initialized in a thread-safe way.

Click here to view code image
```
void func() {
   .... 
   static int val = 2011; 

   ....
}
...

std::thread t1{ func() };
std::thread t2{ func() };
```

CP.4

Think in terms of tasks, rather than threads

What is a task? “Task” is a general term for a unit of execution. Since C++11, we have used task as a special term, which stands for two components: a promise and a future. A promise produces a value that the future can asynchronously pick up. Promise and future can run in different threads and are connected by a secure data channel.

Promise exists in three variations in C++: std::async, std::packaged_task, and std::promise. To get more details on tasks, consult the blog posts I’ve written here: https://www.modernescpp.com/index.php/tag/tasks.

A std::packaged_task and a std::promise have in common that they are quite low level; therefore, I present std::async.

Here are a thread and a future/promise pair that calculate the sum of 3 + 4.

Click here to view code image

// thread
int res;
std::thread t([&]{ res = 3 + 4; });
t.join();
std::cout << res << '\n';
   
// task
auto fut = std::async([]{ return 3 + 4; });
std::cout << fut.get() << '\n';

What is the fundamental difference between a thread and a future/promise pair? A thread is about how something should be calculated; a task is about what should be calculated.

Let me be more specific.

The thread t uses the shared variable res to provide its results. In contrast, the promise std::async uses a secure data channel established by the promise to communicate its result to the future fut. This sharing means for the thread t that you have to protect res.
In case of a thread, you explicitly create a thread. This creation of a thread does not hold automatically for the promise std::async. You specify what should be calculated and not how it should be calculated. The C++ run time decides to create a thread, if necessary.

CP.8

Don’t try to use volatile for synchronization

If you want to have an atomic in Java or C#, you declare it as volatile. You may think that you could do the same in C++. That is wrong. volatile has no multithreading semantics in C++. Atomics are called std::atomic in C++11.

Now, you may be curious: What is the meaning of volatile in C++?

volatile is meant to be used for special objects, on which optimized read or write operations are not allowed. volatile is typically used in embedded programming domains to denote objects that can change independently of the regular program flow. These are, for example, objects that represent an external device (memory-mapped I/O). Because these objects can change independently of the regular program flow, their value is directly written to main memory. Consequently, there is no optimized storing of values in hardware caches.

CP.9

Whenever feasible use tools to validate your concurrent code

This rule is probably one of the most important ones, and I fully agree.

My students write lots of bugs; in fact, even many of my own programs contain bugs! How can I be sure? Because of the dynamic code analysis tool ThreadSanitizer and the static code analysis tool CppMem. The use cases for ThreadSanitizer and CppMem are different.

ThreadSanitizer gives you the big picture and detects if the execution of your program has a data race. CppMem gives you detailed insight into small pieces of your code, most of the time including atomics. You get the answer to the question, Which interleavings are possible according to the memory model?

Let’s start with ThreadSanitizer.

ThreadSanitizer

Here is the official introduction to ThreadSanitizer from “ThreadSanitizerCppManual” (https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual): “ThreadSanitizer (aka TSan) is a data race detector for C/C++. Data races are one of the most common and hardest bugs in concurrent systems. A data race occurs when two threads access the same non-atomic variable concurrently, and at least one of the accesses is a write. C++11 standard officially bans data races as undefined behavior.”

ThreadSanitizer is part of Clang 3.2 and GCC 4.8. To use it, you have to compile and link using the -fsanitize=thread option and use at least optimization level -O2 and the flag -g for producing debugging information: -fsanitize=thread -O2 -g.

The run-time overhead is significant: The memory usage may increase 5 to 10 times and the execution time 2 to 20 times. Of course, you know the outstanding law of software development: First, make your program correct; then, make it fast.

Now let’s see ThreadSanitizer in action. Here is a typical exercise I have often given in my multithreading classes to condition variables:

Write a small ping-pong game.

Two threads should alternatively set a bool value to true or false. One thread sets the value to true and notifies the other thread. The other thread sets the value to false and notifies the original thread. That play should end after a fixed amount of iterations.

And this is the typical implementation my students come up with.

Click here to view code image

 1 // conditionVariablePingPong.cpp
 2 
 3 #include <condition_variable>
 4 #include <iostream>
 5 #include <thread>
 6 
 7 bool dataReady= false; 
 8 
 9 std::mutex mut;
10 std::condition_variable condVar1;
11 std::condition_variable condVar2;
12 
13 int counter = 0;
14 int COUNTLIMIT = 50;
15 
16 void setTrue() { 
17 
18   while(counter <= COUNTLIMIT) { 

19      std::unique_lock<std::mutex> lck(mut);
20      condVar1.wait(lck, []{return dataReady == false;}); 
21      dataReady = true;
22      ++counter; 
23      std::cout << dataReady << '\n';
24      condVar2.notify_one(); 
25   }
26 }
27 
28 void setFalse() { 
29 
30   while(counter < COUNTLIMIT) { 
31      std::unique_lock<std::mutex> lck(mut);
32      condVar2.wait(lck, []{return dataReady == true;});
33      dataReady = false;
34      std::cout << dataReady << '\n';
35      condVar1.notify_one();
36   }
37 }
38 
39 int main() {
40 
41   std::cout << std::boolalpha << '\n';
42 
43   std::cout << "Begin: " << dataReady << '\n';
44 
45   std::thread t1(setTrue);
46   std::thread t2(setFalse);
47 
48   t1.join();
49   t2.join();
50 
51   dataReady = false;
52   std::cout << "End: " << dataReady << '\n';
53 
54   std::cout << '\n';
55 
56 }

Function setTrue (line 16) sets the boolean value dataReady (line 21) to true, and function setFalse (line 28) sets it to false. The play starts with setTrue. The condition variable in the function waits for the notification and therefore first checks the boolean dataReady (line 20). Afterward, the function increments the counter (line 22) and notifies the other thread with the help of the condition variable condVar2 (line 24). The function setFalse follows the same workflow. If the counter becomes equal to COUNTLIMIT (line 18), the game ends. Fine? No!

There is a data race on the counter. It is read (line 30) and written (line 22) with-out synchronization. ThreadSanitizer shows the data race (see Figure 10.2).

ThreadSanitizer detects data races during run time; CppMem lets you analyze small code snippets.

CppMem

Here is a short overview of CppMem.

The online tool, which you can also install on your PC, provides very valuable services.

CppMem validates small code snippets, typically including atomics.
The very accurate analysis of CppMem gives you deep insight into the C++ memory model.

For more in-depth insight, read my blog posts about CppMem at http://www.modernescpp.com/index.php/tag/cppmem. For now, I refer to the first point and provide you a thousand-foot view of CppMem.

My overview uses the default configuration of the tool. This overview should give you the basis for further experiments.

For simplicity reasons, I refer to the red numbers in Figure 10.3.

Model
- This specifies the C++ memory model. The C++ memory model is preferred.
Program
- This is the executable program in C or C++ – like syntax.
- CppMem offers you a bunch of typical interleavings of atomics. To get the details of these programs, read the very well-written article “Mathematizing C++ Concurrency” by Mark Batty et al. at http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf. Of course, you can also use your code.
- CppMem is about multithreading, so there are a few simplifications.
  - You can easily define two threads by the symbols {{{ ... ||| ... }}}. The three dots ( ... ) stand for the work package of the thread.
Display Relations
- This describes the relations between the read, write, and read-write modifications on atomic operations, fences, and locks.
- If the relation is enabled, it is displayed in the annotated graph (see point 6).
  - sb: sequenced-before
  - rf: read from
  - mo: modification order
  - sc: sequential consistency
  - lo: lock order
  - sw: synchronizes-with
  - dob: dependency-ordered-before
  - data_races
Display Layout
- You can choose with this switch which Doxygraph graph is used.
Choose the Executions
- Switch between the various consistent executions.
Annotated Graph
- This displays the annotated graph.

Now, let’s try it out.

The program dataRaceOnX.cpp has a data race on the int variable x. y is an atomic and, therefore, fine from a concurrency perspective.

Click here to view code image

// dataRaceOnX.cpp
   
#include <atomic>
#include <iostream>
#include <thread>

int x = 0;
std::atomic<int> y{0};
   
void writing() { 
   x = 2000; 
   y.store(11); 
}
   
void reading(){ 
   std::cout << y.load() << " "; 
   std::cout << x << '\n'; 
}
   
int main() {
   
   std::thread thread1(writing);
   std::thread thread2(reading);
   
   thread1.join();
   thread2.join();
   
}

In order to use CppMem, you must rewrite your C++ program in the dialect of C expected by CppMem’s parser. Cutting and pasting standard C++ code fails with a cryptic “Frontc.ParseError.” Here is the equivalent program written in the more concise CppMem syntax.

Click here to view code image

// dataRaceOnXCppMem.txt
   
int main(){
   int x = 0;
   atomic_int y = 0;
   
   {{{ 
     {
       x = 2000;
       y.store(11);
     }
   |||

     {
       y.load();
       x;
     }
   }}}
}

CppMem shows it immediately. The first consistent execution has a data race on x. See Figure 10.4.

You can observe the data race in the graph. It is the yellow edge (dr) between the write operation (x=2000) and the read operation (x=0).

To get more details on CppMem, read the blog posts I’ve written at https://www.modernescpp.com/index.php/tag/cppmem.

Concurrency

Concurrency is a challenging topic. This challenge is due, in particular, to the low-level abstraction we have now at our disposal. Knowing and applying the rules of this section is, therefore, crucial to get a well-defined multithreading program.

Let me classify about fifteen rules into categories related to locks, threads, and condition variables, but also data sharing between threads, resource considerations, and a sometimes overlooked danger.

Locks

NNN stands for No Naked New and means that memory allocation should not be done as a standalone operation but inside a manager object (“R.12: Immediately give the result of an explicit resource allocation to a manager object”). The same observation holds for mutexes. Mutexes should immediately be given to a manager object, which is, in this case, a lock. In modern C++, we have a std::lock_guard, std::unique_lock, std::shared_lock (C++14), or std::scoped_lock (C++17). Consequently, keep the acronym NNM, which stands for No Naked Mutex, in mind. Locks implement the RAII idiom. The crucial idea behind the RAII idiom is to bind the lifetime of a resource to the lifetime of a local variable. C++ automatically manages the lifetime of locals.

CP.20

Use RAII, never plain lock()/unlock()

The small code snippet should make the value of a lock immediately clear.

Click here to view code image

std::mutex mtx;
   
void do_stuff() {
   mtx.lock();
   // ... do stuff ... (1)
   mtx.unlock();
}

It doesn’t matter if an exception occurs in (1) or you just forgot to unlock the mtx; in both cases, you will get a deadlock if another thread wants to acquire (lock) the std::mutex mtx. Locks come to the rescue.

Click here to view code image

std::mutex mtx;
   
void do_stuff() {
   std::lock_guard<std::mutex> lck {mtx};
   // ... do stuff ...
}

Put the mutex into a lock, and the mutex is automatically locked in the constructor of the std::lock_guard and unlocked when lck goes out of scope.

CP.21

Use std::lock() or std::scoped_lock to acquire multiple mutexes

If a thread needs more than one mutex at a time, you have to be extremely careful that you always lock the mutexes in the same sequence. If not, a bad interleaving of threads may cause a deadlock. The following program causes a deadlock.

Click here to view code image

// lockGuardDeadlock.cpp
   
#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>
   
struct CriticalData {
   std::mutex mut;
};
   
void deadLock(CriticalData& a, CriticalData& b) {
   
   std::lock_guard<std::mutex> guard1(a.mut);               // (1)
   std::cout << "Thread: " << std::this_thread::get_id() << '\n';
   
   std::this_thread::sleep_for(std::chrono::milliseconds(1));
   
   std::lock_guard<std::mutex> guard2(b.mut);               // (1)
   std::cout << "Thread: " << std::this_thread::get_id() << '\n';
   
   // do something with a and b (critical region)          (2)
}
   
int main() {
   
   std::cout << '\n';
   
   CriticalData c1;
   CriticalData c2;
   
   std::thread t1([&]{deadLock(c1, c2);}); 
   std::thread t2([&]{deadLock(c2, c1);}); 

   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Threads t1 and t2 need two CriticalDatas, to perform their jobs in (2). Critical-Data has its mutex mut to synchronize the access. Unfortunately, both invoke the function deadlock with the arguments c1 and c2 in a different sequence (1). Now we have a race condition that could end up in a deadlock. Thread t1 can lock the first mutex a.mut but not the second one b.mut because, in the meantime, thread t2 locked the second one. See Figure 10.5.

The easiest way to solve the deadlock is to lock both mutexes atomically.

With C++11, you can use a std::unique_lock together with std::lock. Thanks to the tag std::defer_lock, the std::unique_lock takes the mutex without locking it. The locking finally takes place in the std::lock call. std::lock can take an arbitrary number of arguments.

Click here to view code image

void deadLock(CriticalData& a, CriticalData& b) {
   std::unique_lock<mutex> guard1(a.mut, std::defer_lock);
   std::unique_lock<mutex> guard2(b.mut, std::defer_lock);
   std::lock(guard1, guard2);
   // do something with a and b (critical region)
}

With C++17, a std::scoped_lock can lock an arbitrary number of mutexes in an atomic operation.

Click here to view code image

void deadLock(CriticalData& a, CriticalData& b) {
   std::scoped_lock scoLock(a.mut, b.mut);

   // do something with a and b (critical region)
}

CP.22

Never call unknown code while holding a lock (e.g., a callback)

Why is this code snippet bad, and why should it not pass a code review?

Click here to view code image

std::mutex m; 
{
   std::lock_guard<std::mutex> lockGuard(m);
   sharedVariable = unknownFunction();
}

I can only speculate about the unknownFunction. If unknownFunction

Tries to lock the mutex m, you get undefined behavior. Most of the time, the result of this undefined behavior is a deadlock.
Starts a new thread that tries to lock the mutex m, you get a deadlock.
Locks another mutex m2, you may get a deadlock because you lock the two mutexes m and m2 at the same time. Another thread may lock the same mutexes in a different sequence.
Does not directly or indirectly try to lock the mutex m, all seems to be fine. “Seems” because a coworker can modify the function, and you get a changed version of the function unknownFunction. Now all bets are off.
Works as expected, you still may have a performance issue because the unknownFunction takes quite a while. What was meant to be a multithreaded program behaves similarly to a single-threaded program.

To overcome the issues, use a local variable and invoke the unknown function outside of the critical region.

Click here to view code image

std::mutex m;
auto tempVar = unknownFunction();
{
   std::lock_guard<std::mutex> lockGuard(m);
   sharedVariable = tempVar;
}

This additional indirection solves all issues. tempVar is a local variable and can, therefore, not be a victim of a data race. No victim means that you can invoke unknownFunction without a synchronization mechanism. Additionally, the time for holding the lock is reduced to its bare minimum: assigning the value of tempVar to sharedVariable.

Threads

Threads are the basic building block for concurrent and parallel programming. With each new C++ standard, threads become more and more an implementation detail for concurrency. For example, with C++17, we got the parallel STL, which allows us to specify the execution policy; with C++20, coroutines; and with C++23, we can hope for transactional memory.

CP.23

Think of a joining thread as a scoped container

and

CP.24

Think of a thread as a global container

The slight variation of the code snippet from the C++ Core Guidelines should make both rules clear:

Click here to view code image

void f(int* p) {
   // ...
   *p = 99;
   // ...
}
   
int glob = 33;
   
void some_fct(int* p) {                  // (1)
   int x = 77;
   std::thread t0(f, &x);               // OK
   std::thread t1(f, p);                // OK
   std::thread t2(f, &glob);           // OK
   auto q = make_unique<int>(99);
   std::thread t3(f, q.get());         // OK
   // ...
   t0.join();

   t1.join();
   t2.join();
   t3.join();
   // ...
}
   
void some_fct2(int* p) {                 // (2)
   int x = 77;
   std::thread t0(f, &x);               // bad
   std::thread t1(f, p);                // bad
   std::thread t2(f, &glob);            // OK
   auto q = make_unique<int>(99);
   std::thread t3(f, q.get());         // bad
   // ...
   t0.detach();
   t1.detach();
   t2.detach();
   t3.detach();
   // ...
}

The only difference between the functions some_fct (1) and some_fct2 (2) is that the first variation joins its created threads, but the second variation detaches all created threads.

First of all, you have to join or detach the children threads. If you don’t, you get a std::terminate in the destructor of the child thread (see rule “CP.25: Prefer gsl::joining_thread over std::thread”).

The difference between joining or detaching a created thread is the following: When the creator calls a thr.join() call on the created thread thr, it waits until the created thread is done. thr.join() is a synchronization point. To put it the other way around, the child thread thr can use all variables (state) of the enclosing scope in which it was created. Consequently, all calls of the function f are well-defined.

On the contrary, a thr.detach() call does not wait and is, therefore, not a synchronization point. This means that the created thread can outlive its creator. Consequently, using variables of the enclosing scope may not be valid anymore. This is precisely the issue in the function some_fct2. The variable x, the pointer p, or the resource of the std::unique_ptr q may not be valid anymore.

A thread can be seen as a global container using variables from outside. Additionally, in case of a joining thread, the lifetime of the container is scoped.

CP.25

Prefer std::jthread over std::thread

The original title of this rule is “CP.25: Prefer gsl::joining_thread over std::thread.” I replaced gsl::joining_thread from the Guidelines Support Library with std::jthread from C++20.

In the following program, I forgot to join the thread t.

Click here to view code image

// threadWithoutJoin.cpp
   
#include <iostream>
#include <thread>
   
int main() {
   
   std::thread t([]{ 
      std::cout << std::this_thread::get_id() << '\n';
   });
   
}

The execution of the program ends abruptly. See Figure 10.6.

The lifetime of the created thread t ends with its callable. The creator has two choices. First, it waits until its child is done (t.join()). Second, it detaches itself from its child (t.detach()). A thread t with a callable unit—you can create threads without callable units—is called joinable if neither a t.join() nor t.detach() call happened. The destructor of a joinable thread throws a std::terminate exception, which ends in std::abort. In our case, the program terminated before the child thread had time to display its id.

In addition to a std::thread, a std::jthread automatically joins on destruction. Replacing the std::thread with a std::jthread, therefore, solves the issue.

Click here to view code image

// threadWithJoin.cpp; C++20
   
#include <iostream>
#include <thread>

int main() {
   
   std::jthread t([]{ 
      std::cout << std::this_thread::get_id() << '\n';
   });
   
}

CP.26

Don’t detach() a thread

This rule sounds strange. The C++11 standard supports detaching a thread, but we should not do it! The reason is that detaching a thread can be quite challenging. For example, have a look at this small program that has undefined behavior. Even objects with static duration can be critical.

Click here to view code image

// threadDetach.cpp
   
#include <iostream>
#include <string>
#include <thread>
   
void func() {
   std::string s{"C++11"};
   std::thread t([&s]{ std::cout << s << '\n';});
   t.detach();
}
   
int main() {
   func();
}

The lambda takes s by reference. This is undefined behavior because the child thread t uses the variable s, which goes out of scope. Stop! This is the obvious problem, but the hidden issue that many programmers overlook is std::cout. std::cout has static duration. Static duration means the lifetime of std::cout ends with the end of the process, and we have, additionally, a race condition: Thread t may use std::cout at this time.

Race condition: A race condition is a situation in which the result of an operation depends on the interleaving of certain individual operations.

Condition variables

CP.42

Don’t wait without a condition

Condition variables support a quite simple concept. One thread prepares something and sends a notification to another thread that is waiting for it.

Here is the rationale for the rule: “A wait without a condition can miss a wakeup or wake up simply to find that there is no work to do.” What does that mean? Condition variables are subject to two very serious issues: lost wakeups and spurious wake-ups. The key concern about condition variables is that they have no memory.

Before I present you with this issue, let me first present the correct way to use condition variables.

Click here to view code image

// conditionVariable.cpp
   
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <thread>
   
std::mutex mut;
std::condition_variable condVar; 
   
bool dataReady{false};
   
void waitingForWork() {
   std::cout << "Waiting " << '\n';
   std::unique_lock<std::mutex> lck(mut);
   condVar.wait(lck, []{ return dataReady; });                               // (4) 
   std::cout << "Running " << '\n';
}
   
void setDataReady() {
   {
      std::lock_guard<std::mutex> lck(mut);
      dataReady = true;
   }
   std::cout << "Data prepared" << '\n';
   condVar.notify_one();                                                    // (3) 
}
   
int main() {
   
   std::cout << '\n';

   std::thread t1(waitingForWork);                                         // (1) 
   std::thread t2(setDataReady);                                          // (2) 
   
   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

How does the synchronization work? The program has two children threads: t1 and t2. They get their work package waitingForWork and setDataReady (1) and (2). setDataReady sends a notification—using the condition variable condVar—that it is done with the preparation of the work: condVar.notify_one()(3). While holding the lock, thread t1 waits for its notification: condVar.wait(lck, []{ return data-Ready; })(4). The sender and receiver need a lock. In the case of the sender, a std::lock_guard is sufficient because it calls lock and unlock only once. In the case of the receiver, a std::unique_lock is necessary because it usually frequently locks and unlocks its mutex.

Figure 10.7 shows the output of the program.

Maybe you are wondering, Why do we need a predicate for the wait call since you can invoke wait without a predicate? This workflow seems too complicated for such a simple synchronization of threads.

Now we are back to the missing memory of condition variables and the two phenomena called lost wakeup and spurious wakeup.

Lost wakeup: The phenomenon of the lost wakeup is that the sender sends its notification before the receiver begins waiting. The consequence is that the notification is lost.
Spurious wakeup: The receiver may wake up even if no notification has been sent. At least POSIX Threads and the Windows API can be victims of these phenomena.

To avoid these two issues, you have to use an additional predicate as memory or, as the rule states it, an additional condition. If you use a condition variable without a predicate, it’s possible that you will get a lost wakeup and, therefore, a deadlock because the waiting thread is waiting for something that never happens.

The following program uses a condition variable without an additional predicate. Let’s see what happens:

Click here to view code image

// conditionVariableWithoutPredicate.cpp
   
#include <condition_variable>
#include <iostream>
#include <mutex>
#include <thread>
   
std::mutex mut;
std::condition_variable condVar;
   
void waitingForWork() {
   std::cout << "Waiting " << '\n';
   std::unique_lock<std::mutex> lck(mut);
   condVar.wait(lck); 
   std::cout << "Running " << '\n';
}
   
void setDataReady() {
   std::cout << "Data prepared" << '\n';
   condVar.notify_one(); 
}
   
int main() {
   
   std::cout << '\n';
   
   std::thread t1(waitingForWork);
   std::thread t2(setDataReady);

   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Whenever the notification thread t2 runs before the waiting thread t1, the notification is lost. The second execution shows this phenomenon that results in a deadlock. See Figure 10.8.

Data sharing

The less you share data, and the more you work with local variables, the better. Sometimes, though, you have no other option but to share data. For example, the child thread wants to communicate its work to its parent thread.

CP.31

Pass small amounts of data between threads by value, rather than by reference or pointer

Passing data to a thread by value immediately gives you two benefits:

There is no sharing, and therefore, no data race is possible. The requirement for a data race is a mutable, shared state.
You don’t have to be concerned about the lifetime of the data. The data stays alive for the lifetime of the created thread.

Of course, the crucial question is, What does a small amount of data mean? The C++ Core Guidelines are not clear about this point. In the rule “F.16: For ‘in’ parameters, pass cheaply-copied types by value and others by reference to,” the C++ Core Guidelines state that 4 * sizeof(int) is a rule of thumb for functions. This means that smaller than 4 * sizeof(int) should be passed by value, bigger than 4 * sizeof(int) by reference or pointer.

In the end, you must measure the performance of your program if necessary.

CP.32

To share ownership between unrelated threads use shared_ptr

Imagine that you have an object that you want to share between unrelated threads. Unrelated means that there is no data race on that object. The key question is, Who is the owner of the object and, therefore, responsible for releasing the memory? Now you can choose between a memory leak if you don’t deallocate the memory or undefined behavior if you invoke delete more than once. Most of the time, the undefined behavior causes a run-time crash.

Click here to view code image

// threadSharesOwnership.cpp
   
#include <iostream>
#include <thread>
   
using namespace std::literals::chrono_literals;
   
struct MyInt {
   int val{2017};
   ~MyInt() {                                                                 // (4)
      std::cout << "Goodbye" << '\n'; 
   }
};
   
void showNumber(const MyInt* myInt) {
   std::cout << myInt->val << '\n';
}
   
void threadCreator() {
   MyInt* tmpInt= new MyInt;                                                  // (1)
   
   std::thread t1(showNumber, tmpInt);                                        // (2)
   std::thread t2(showNumber, tmpInt);                                        // (3)

   t1.detach();
   t2.detach();
}
   
int main() {
   
   std::cout << '\n';
   
   threadCreator();
   std::this_thread::sleep_for(1s);
   
   std::cout << '\n';
   
}

The example is intentionally simple. I let the main thread sleep for one second to be sure that it outlived the lifetime of the children threads t1 and t2. This is, of course, not an appropriate synchronization, but it helps to make my point. The vital issue of the program is, Who is responsible for the deletion of tmpInt (1)? Thread t1 (2), thread t2 (3), or the function (main thread) itself? Because I cannot forecast how long each thread runs, I decided to go with a memory leak. Consequently, the destructor of MyInt (4) is never called (see Figure 10.9).

The lifetime issues are easy to handle if I use a std::shared_ptr.

Click here to view code image

// threadSharesOwnershipSharedPtr.cpp
   
#include <iostream>
#include <memory>
#include <thread>

using namespace std::literals::chrono_literals;
   
struct MyInt {
   int val{2017};
   ~MyInt() {
      std::cout << "Goodbye" << '\n';
   }
};
   
void showNumber(std::shared_ptr<MyInt> myInt) { 
   std::cout << myInt->val << '\n';
}
   
void threadCreator() {
   auto sharedPtr = std::make_shared<MyInt>(); // (1) 
   
   std::thread t1(showNumber, sharedPtr);
   std::thread t2(showNumber, sharedPtr);
   
   t1.detach();
   t2.detach();
}
   
int main() {
   
   std::cout << '\n';
   
   threadCreator();
   std::this_thread::sleep_for(1s);
   
   std::cout << '\n';
   
}

Two small changes to the source code are necessary. First, the pointer in (1) becomes a std::shared_ptr, and second, the function showNumber takes a smart pointer instead of a raw pointer. I assume for simplicity reasons that the threads t1 and t2 are done within a second. See Figure 10.10.

Resources

One of the main reasons for using concurrency is performance. You have to keep in mind that using threads requires resources: time and memory. The resource usage begins with creation, continues with context switching from user space, to kernel space, and ends with the destruction of a thread. Additionally, a thread has its own state that has to be allocated and maintained.

CP.40

Minimize context switching

and

CP.41

Minimize thread creation and destruction

How expensive is a thread? The answer to this question is the reason for this rule. Let me first talk about the usual size of a thread and then about the costs of its creation.

Size

A std::thread is a thin wrapper around the native thread, and I’m, therefore, interested in the size of a Windows thread and a POSIX thread.

Windows systems: The blog post “Thread Stack Size” on the Microsoft web-site (https://msdn.microsoft.com/en-us/library/windows/desktop/ms686774(v=vs.85).aspx) gave me the answer: 1MB.
Linux systems: The pthread_create man page (http://man7.org/linux/man-pages/man3/pthread_create.3.xhtml) provided me with the answer: 2MB. This number applies to the i386 and x86_64 architectures. If you want to know the sizes for other architectures that support POSIX, see Table 10.1.

Table 10.1 Typical thread size

Architecture	Default stack size
i386	2MB
IA-64	32MB
PowerPC	4MB
S/390	2MB
Sparc-32	2MB
Sparc-64	4MB
x86_64	2MB

Creation

I didn’t find numbers on how much time it takes to create a thread. To get a gut feeling, I made a simple performance test on Linux and Windows. Don’t use the numbers to compare Linux and Windows. This is not the point of this experiment.

I used GCC 6.2.1 on a desktop and Microsoft Visual Studio 2017 on a laptop for my performance tests. I compiled the programs with maximum optimization on both platforms.

Here is the small test program.

Click here to view code image

// threadCreationPerformance.cpp
   
#include <chrono>
#include <iostream>
#include <thread>
   
constexpr long long numThreads= 1'000'000;
   
int main() {
   
   auto start = std::chrono::system_clock::now();
   
   for (long long i = 0; i < numThreads; ++i) {           // (1)
      std::thread([]{}).detach(); 
   }
   
   std::chrono::duration<double> dur = 
      std::chrono::system_clock::now() - start;
   
   std::cout << "time: " << dur.count() 
   
}

The program creates one million threads that execute an empty lambda function (1). Figures 10.11 and 10.12 show the results for Linux and Windows.

This means that the creation of a single thread took about 14.5 seconds / 1000000 = 14.5 microseconds on Linux and about 44 seconds / 1000000 = 44 microseconds on Windows.

To put it another way, in one second, you can create about 69,000 threads on Linux and 23,000 threads on Windows.

CP.43

Minimize time spent in a critical section

The less time you lock a mutex, the more time other threads can run. Let’s take the example of the notification of a condition variable.

Click here to view code image

void setDataReady() {
   std::lock_guard<std::mutex> lck(mut);
   dataReady = true;                               // (1) 
   std::cout << "Data prepared" << '\n'; 
   condVar.notify_one(); 
}

The mutex mut is locked at the beginning of the function and unlocked at the end of the function. This locking is not necessary. Only the expression dataReady = true (1) has to be protected.

First, std::cout is thread safe. The C++11 standard guarantees that each character is written in an atomic step and the right sequence. Second, the notification condVar.notify_one() is thread safe.

Here is the improved version of the function setDataReady:

Click here to view code image

void setDataReady() {
   { 
      std::lock_guard<std::mutex> lck(mut);
      dataReady = true;
   } 
   std::cout << "Data prepared" << '\n';
   condVar.notify_one(); 
}

Overlooked danger

CP.44

Remember to name your lock_guards and unique_locks

When you don’t name the std::lock_guard or the std::unique_lock, you just create a temporary that is created and immediately destroyed. The std::lock_guard or std::unique_lock automatically locks its mutex and its constructor and unlocks it in its destructor. This pattern is called RAII.

My small example shows the conceptual behavior of a std::lock_guard. Its big brother std::unique_lock supports more operations.

Click here to view code image

// myGuard.cpp
   
#include <mutex>
#include <iostream>
   
template <typename T>
class MyGuard {
public:
   explicit MyGuard(T& m): myMutex(m) {
      std::cout << "lock" << '\n';
      myMutex.lock();
   }
   ~MyGuard() {
      myMutex.unlock();
      std::cout << "unlock" << '\n';
   }
private:
   T& myMutex;
};
   
int main() {
   
   std::cout << '\n';
   
   std::mutex m;
   MyGuard<std::mutex> {m};                                          // (1) oops!
   std::cout << "CRITICAL SECTION" << '\n';                         // (2)
   
   std::cout << '\n';
   
}                                                                 // (3)

MyGuard calls lock and unlock in its constructor and its destructor. Because of the temporary, the call to the constructor and destructor happens in (1) and not, as usual, in line (3). As a consequence, the critical section in line (2) is executed unprotected.

This execution of the program shows that the output of the message unlock happens before the output of the message CRITICAL SECTION. See Figure 10.13.

By giving the unnamed MyGuard MyGuard<std::mutex> {m}; (1) a name MyGuard<std::mutex> myGuard{m};, the critical section becomes protected. See Figure 10.14.

Parallelism

Besides the title, there is no content regarding parallelism in the C++ Core Guidelines. To fill the gap, I have provided a short introduction to the parallel STL and a rule. First of all, here is my rule.

Prefer the parallel algorithms of the STL to handcrafted solutions with threads.

The idea is quite simple. The Standard Template Library has more than 100 algorithms for searching, counting, and manipulating of ranges and their elements. With C++17, 69 of them are overloaded and a few new ones are added. The overloaded and new algorithms can be invoked with a so-called execution policy. By using the execution policy, you can specify whether the algorithm should run sequential, parallel, or parallel and vectorized.

std::execution::seq: runs the algorithm sequentially
std::execution::par: runs the algorithm in parallel on multiple threads
std::execution::par_unseq: runs the algorithm in parallel on multiple threads and allows the interleaving of individual loops

Vectorization (std::execution::par_unseq) stands for the SIMD (Single Instruction, Multiple Data) extensions of the instruction set of a modern processor. SIMD enables your processor to execute one operation in parallel on multiple data.

With the execution policy tag, you can choose which variant of an algorithm should be performed. The tag is not binding but is a strong hint to the C++ run time.

Click here to view code image

std::vector<int> v = {5, -3, 10, -5, -10, 22, 0};
   
// standard sequential sort
std::sort(v.begin(), v.end());
   
// sequential execution
std::sort(std::execution::seq, v.begin(), v.end());
   
// permitting parallel execution
std::sort(std::execution::par, v.begin(), v.end());
   
// permitting parallel and vectorized execution
std::sort(std::execution::par_unseq, v.begin(), v.end())

Sixty-nine of the algorithms of the STL support a parallel or a parallel and vectorized execution. See Table 10.2.

Table 10.2 Algorithms of the STL for which parallel versions are available (the std namespace is omitted)

`adjacent_difference`	`is_heap_until`	`replace_copy_if`
`adjacent_find`	`is_partitioned`	`replace_if`
`all_of`	`is_sorted`	`reverse`
`any_of`	`is_sorted_until`	`reverse_copy`
`copy`	`lexicographical_compare`	`rotate`
`copy_if`	`max_element`	`rotate_copy`
`copy_n`	`merge`	`search`
`count`	`min_element`	`search_n`
`count_if`	`minmax_element`	`set_difference`
`equal`	`mismatch`	`set_intersection`
`fill`	`move`	`set_symmetric_difference`
`fill_n`	`none_of`	`set_union`
`find`	`nth_element`	`sort`	`find_end`
`partial_sort`	`stable_partition`
`find_first_of`	`partial_sort_copy`	`stable_sort`
`find_if`	`partition`	`swap_ranges`
`find_if_not`	`partition_copy`	`transform`
`generate`	`remove`	`uninitialized_copy`
`generate_n`	`remove_copy`	`uninitialized_copy_n`
`includes`	`remove_copy_if`	`uninitialized_fill`
`inner_product`	`remove_if`	`uninitialized_fill_n`
`inplace_merge`	`replace`	`unique`
`is_heap`	`replace_copy`	`unique_copy`

Additionally, eight new algorithms are added with C++17.

Click here to view code image

std::for_each
std::for_each_n
std::exclusive_scan
std::inclusive_scan
std::transform_exclusive_scan
std::transform_inclusive_scan
std::reduce
std::transform_reduce

The following example shows the usage of the std::transform_exclusive_scan algorithm.

Click here to view code image

// transformExclusiveScan.cpp; C++17 with MSVC
   
#include <execution>
#include <numeric>
#include <iostream>
#include <vector>
   
int main() {

   std::cout << '\n';
   
   std::vector<int> resVec{1, 2, 3, 4, 5, 6, 7, 8, 9};
   std::vector<int> resVec1(resVec.size()); 
   std::transform_exclusive_scan(std::execution::par, 
                            resVec.begin(), resVec.end(), 
                            resVec1.begin(), 0,
                            [](int fir, int sec){ return fir + sec; },
                            [](int arg){ return arg * arg; });
   
   std::cout << "transform_exclusive_scan: ";
   for (auto v: resVec1) std::cout << v << " ";
   
   std::cout << '\n';
   
}

The std::transform_exclusive_scan algorithm is quite challenging to read. Let me try to explain it. In the first step, std::transform_exclusive_scan applies the lambda expression [](int arg){ return arg * arg; } to each element of the range resVec.begin() to resVec.end(). In the second step, the algorithm applies the binary operation [](int fir, int sec){ return fir + sec; } to the intermediate vector. This means the algorithm sums up all elements using 0 as the initial value. The result is placed in resVec1. See Figure 10.15.

Message passing

The section on message passing has two rules.

Both rules lack content. Therefore, I have to improvise.

A task is the C++-ish way to pass messages between threads. The message can be a value, an exception, or a notification. A task consists of the two components promise and future. Promise exists in three variations in C++: std::async, std::packaged_task, and std::promise. The promise creates the message, which the future picks up asynchronously.

I already gave a short example of std::async in the rule “CP.4: Think in terms of tasks, rather than threads” to send a value from the promise to the future. In this section, I use a std::promise as the sender.

Sending a value, or an exception

In contrast to a thread, the promise and the associated future share a secure channel. In the following example, one promise sends a value, and one promise sends an exception.

Click here to view code image

// promiseFutureException.cpp
   
#include <exception>
#include <future>
#include <iostream>
#include <thread>
#include <utility>
   
struct Div {
   void operator()(std::promise<int> intPromise, int a, int b) const {
      try {                                                      // (4)
         if (b == 0) {
            std::string err = "Illegal division by zero: " + 
                              std::to_string(a) + "/" + std::to_string(b);
           throw std::runtime_error(err);
       }
       intPromise.set_value(a / b);                             // (2)
      }
      catch ( ... ) {
         intPromise.set_exception(std::current_exception());   // (1)
      }
   }
};
   
void executeDivision(int nom, int denom) {
   std::promise<int> divPromise;
   std::future<int> divResult= divPromise.get_future();

   Div div;
   std::thread divThread(div, std::move(divPromise), nom, denom);
   
   // get the result or the exception                        // (5)
   try {
      std::cout << nom << "/" << denom << " = " 
   }
   catch (std::runtime_error& e){ 
      std::cout << e.what() << '\n';
   }
   
   divThread.join();
}
   
int main() {
   
   std::cout << '\n';
   
   executeDivision(20, 0);
   executeDivision(20, 10);
   
   std::cout << '\n';
   
}

If the callable used by std::promise throws an error, the exception is stored in the shared state. When the future divResult then calls divResult.get() (3), the exception is rethrown, and the associated future has to handle it. The std::promise prom set the exception via prom.set_value(std::current_exception()) (1) and the value via divPromise.set_value (2) as the shared state. As the promise in (4), the future has to deal with the exception in its try-catch block (5). Dividing a number by 0 is undefined behavior. The function executeDivision displays the result of the calculation or the exception. See Figure 10.16.

Sending a notification

If you use promises and futures (in short tasks) to synchronize threads, they have a lot in common with condition variables. Most of the time, promises and futures are the safer choice than condition variables.

Before I present you with an example, Table 10.3 shows the big picture.

Table 10.3 Condition variables versus tasks

Criteria	Condition variables	Tasks
Multiple synchronizations	Yes	No
Critical section	Yes	No
Spurious wakeup	Yes	No
Lost wakeup	Yes	No

The advantage of a condition variable with respect to a promise and future is that you can use condition variables to synchronize threads multiple times. In contrast to that, a promise can send its notification only once. If you use a condition variable for only one synchronization, the condition variable is a lot more challenging to use correctly. A promise and future pair needs no locks and is not prone to spurious or lost wakeups, and there are no critical sections or additional conditional.

Click here to view code image

// promiseFutureSynchronize.cpp
   
#include <future>
#include <iostream>
#include <utility>
   
void waitingForWork(std::future<void> fut) {
   std::cout << "Waiting " << '\n';
   fut.wait();                                                                       // (5)
   std::cout << "Running " << '\n';
}
   
void setDataReady(std::promise<void> prom) {
   std::cout << "Data prepared" << '\n';
   prom.set_value();                                                                // (6)
}
   
int main() {
   
   std::cout << '\n';

   std::promise<void> sendReady;                                                    // (1)
   auto fut = sendReady.get_future();                                               // (2)
   
   std::thread t1(waitingForWork, std::move(fut));                                 // (3)
   std::thread t2(setDataReady, std::move(sendReady));                             // (4)
   
   t1.join();
   t2.join();
   
   std::cout << '\n';
   
}

Thanks to sendReady (1), you get a future fut (2). Both communication endpoints are moved into threads t1 (3) and t2 (4). The future waits using the call fut.wait() (5), and it gets the notification of the associated promise: prom.set_value() (6).

The structure and the output of the program match the corresponding program in the rule for condition variables: “C.42: Don’t wait without a condition.” See Figure 10.17.

Lock-free programming

The rules on concurrency and parallelism target non-experts. Lock-free programming is an experts-only topic. Consequently, there are only a few short rules to lock-free programming.

CP.100

Don’t use lock-free programming unless you absolutely have to

This rule is the most critical meta-rule to lock-free programming. If you don’t believe me, here are a few quotes from talks given by worldwide recognized experts in this particular domain.

Herb Sutter: “Lock-free programming is like playing with knives” (CppCon 2014).
Anthony Williams: “Lock-free programming is about how to shoot yourself in the foot” (NDC 2016).
Tony Van Eerd: “Lock-free coding is the last thing you want to do” (NDC 2016).
Fedor Pikus: “Writing lock-free programs is hard. Writing correct lock-free programs is even harder” (NDC 2018).

CP.101

Distrust your hardware/compiler combination

What does “distrust your hardware/compiler combination” mean? Let me put it another way: When you break the sequential consistency, you also break your intuition with high probability. Let me start with a simple program.

Click here to view code image

// sequentialConsistency.cpp
   
#include <atomic>
#include <iostream>
#include <thread>
   
std::atomic<int> x{0};
std::atomic<int> y{0};
   
void writing(){ 
   x.store(2000);                   // (1)
   y.store(11);                    // (2)
}
   
void reading(){ 
   std::cout << y.load() << " ";   // (3)
   std::cout << x.load() << '\n';  // (4) 
}

int main(){
   std::thread thread1(writing);
   std::thread thread2(reading);
   thread1.join();
   thread2.join();
}

I have a question about a short example: Which values for y and x are possible in (3) and (4)? Because y and x are atomic, no data race is possible. I further don’t specify the memory ordering; therefore, sequential consistency applies. Sequential consistency means the following:

Each thread performs its operation in the specified sequence: (1) happens before (2), and (3) happens before (4).
There is a global order of all operations on all threads. To put it the other way around, each thread sees all operations in the same sequence.

If you combine these two properties of the sequential consistency, there is only one combination of x and y not possible: y == 11 and x == 0. Now let me break the sequential consistency and maybe your intuition.

The relaxed semantics is the weakest of all memory orderings. Relaxed semantics essentially boils down to one guarantee: Operations on atomics only guarantee atomicity.

Click here to view code image

// relaxedSemantic.cpp
   
#include <atomic>
#include <iostream>
#include <thread>
   
std::atomic<int> x{0};
std::atomic<int> y{0};
   
void writing(){ 
   x.store(2000, std::memory_order_relaxed); 
   y.store(11, std::memory_order_relaxed); 
}
   
void reading(){ 
   std::cout << y.load(std::memory_order_relaxed) << " "; 
   std::cout << x.load(std::memory_order_relaxed) << '\n'; 
}

int main(){
   std::thread thread1(writing);
   std::thread thread2(reading);
   thread1.join();
   thread2.join();
}

Two highly unintuitive phenomena can happen. First, thread2 can see the operations of thread1 in a different sequence. Second, thread1 can reorder its instruction because it is not performed on the same atomic. What does this mean for the possible values of x and y? y == 11 and x == 0 is now a possible result. I want to be more specific. Which result is possible depends on your hardware. See Table 10.4.

Table 10.4 Operation reordering on various platforms

Architecture	LoadLoad	LoadStore	StoreLoad	StoreStore
x86, AMD64			Yes
Alpha, IA64, RISC	Yes	Yes	Yes	Yes

For example, operation reordering is quite conservative on x86 or AMD64; stores can be reordered after loads, but on Alpha, IA64, or RISC (ARM) architectures, all four possible reorderings of store and load operations are allowed.

LoadLoad in the table means for one control flow that a load operation on an atomic is followed by a load operation on another atomic. The same argumentation applies to LoadStore, StoreLoad, and StoreStore.

CP.102

Carefully study the literature

Here are a few resources for outstanding literature. Study them first.

Anthony Williams: C++ Concurrency in Action, 2nd ed., Manning Publications, 2019, ISBN 9781617294693 (https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition)
Bartosz Milewski: Bartosz Milewski’s Programming Cafe (https://bartoszmilewski.com/)
Herb Sutter: Effective Concurrency (http://www.gotw.ca/publications/)
Jeff Preshing: Preshing on Programming (https://preshing.com/)

Related rules

The two rules “CP.110: Do not write your own double-checked locking for initialization” and “CP.111: Use a conventional pattern if you really need double-checked locking” are already covered by the rule about wrong assumptions.
The rule “CP.200: Use volatile only to talk to non-C++ memory” was already addressed in the rule “CP.8: Don’t try to use volatile for synchronization.”

Distilled

Important

Distinguish concurrency from parallelism. Concurrency is the overlap of several tasks, but parallelism is when several tasks run at the same time.
Avoid data races by minimizing the sharing of data, and make your shared data immutable.
Use tools such as ThreadSanitizer or CppMem to validate your concurrent code.
Don’t lock or unlock a mutex directly. Put your mutexes into a lock such as std::lock_guard or std::unique_lock.
Don’t call unknown code while holding a lock. Try to acquire not more than one lock at any point in time.
When you need more than one lock at a given time, acquire them atomically by using std::lock or std::scoped_lock.
Use std::jthread instead of std::thread for automatic rejoining at destruction.
Don’t use a condition variable without an additional predicate to avoid spurious wakeup and lost wakeup.
If you want to perform a job in parallel, prefer the parallel algorithms of the STL over handcrafted solutions with threads.
Use tasks to pass messages or exceptions between threads. Use tasks and not condition variables to synchronize threads.
Use lock-free programming techniques only if you have to. Study the literature carefully beforehand.