Over 10 years after its first release, the Threading Building Blocks (TBB) library has become one of the most widely used C++ libraries for parallel programming. While it has retained its core philosophy and features, it continues to expand to address new opportunities and challenges as they arise.
In this chapter, we discuss the motivations for TBB, provide a brief overview of its main components, describe how to get the library, and then jump right into a few simple examples.
Why Threading Building Blocks?
Parallel programming has a long history, stretching back to the 1950s and beyond. For decades, scientists have been developing large-scale parallel simulations for supercomputers, and businesses have been developing enterprise applications for large multiprocessor mainframes. But a little over 10 years ago, the first multicore chips intended for desktops and laptops started to enter the marketplace. This was a game changer.
The number of processors in these first multicore desktop and laptop systems was small – only two cores – but the number of developers that had to become parallel programmers was huge. If multicore processors were to become mainstream, parallel programming had to become mainstream too, especially for developers who care about performance.
The TBB library was first released in September of 2006 to address the unique challenges of mainstream parallel programming. Its goal now, and when it was first introduced over 10 years ago, is to provide an easy and powerful way for developers to build applications that continue to scale as new platforms, with different architectures and more cores, are introduced. This “future-proofing” has paid off as the number of cores in mainstream processors has grown from two in 2006 to more than 64 in 2018!
To achieve this goal of future-proofing parallel applications against changes in the number and capabilities of processing cores, the key philosophy behind TBB is to make it easy for developers to express the parallelism in their applications, while limiting the control they have over the mapping of this parallelism to the underlying hardware. This philosophy can seem counterintuitive to some experienced parallel programmers. If we believe parallel programming must get maximum performance at all costs by programming to the bare metal of a system, and hand-tuning and optimizing applications to squeeze out every last bit of performance, then TBB may not be for us. Instead, the TBB library is for developers that want to write applications that get great performance on today’s platforms but are willing to give up a little performance to ensure that their applications continue to perform well on future systems.
To achieve this end, the interfaces in TBB let us express the parallelism in our applications but provide flexibility to the library so it can effectively map this parallelism to current and future platforms, and to adapt it to dynamic changes in system resources at runtime.
Performance: Small Overhead, Big Benefits for C++
We do not mean to make too big a deal about performance loss, nor do we wish to deny it. For simple C++ code written in a “Fortran” style, with a single layer of well-balanced parallel loops, the dynamic nature of TBB may not be needed at all. However, the limitations of such a coding style are an important factor in why TBB exists. TBB was designed to efficiently support nested, concurrent, and sequential composition of parallelism and to dynamically map this parallelism on to a target platform. Using a composable library like TBB, developers can build applications by combining components and libraries that contain parallelism without worrying that they will negatively interfere with each other. Importantly, TBB does not require us to restrict the parallelism we express to avoid performance problems. For large, complicated applications using C++, TBB is therefore easy to recommend without disclaimers.
The TBB library has evolved over the years to not only adjust to new platforms but also to demands from developers that want a bit more control over the choices the library makes in mapping parallelism to the hardware. While TBB 1.0 had very few performance controls for users, TBB 2019 has quite a few more – such as affinity controls, constructs for work isolation, hooks that can be used to pin threads to cores, and so on. The developers of TBB worked hard to design these controls to provide just the right level of control without sacrificing composability.
The interfaces provided by the library are nicely layered – TBB provides high-level templates that suit the needs of most programmers, focusing on common cases. But it also provides low-level interfaces so we can drill down and create tailored solutions for our specific applications if needed. TBB has the best of both worlds. We typically rely on the default choices of the library to get great performance but can delve into the details if we need to.
Evolving Support for Parallelism in TBB and C++
Both the TBB library and the C++ language have evolved significantly since the introduction of the original TBB. In 2006, C++ had no language support for parallel programming, and many libraries, including the Standard Template Library (STL), were not easily used in parallel programs because they were not thread-safe.
Even though we are big fans of TBB, we would in fact prefer if all of the fundamental support needed for parallelism is in the C++ language itself. That would allow TBB to utilize a consistent foundation on which to build higher-level parallelism abstractions. The original versions of TBB had to address a lack of C++ language support, and this is an area where the C++ standard has grown significantly to fill the foundational voids that TBB originally had no choice but to fill with features such as portable locks and atomics. Unfortunately, for C++ developers, the standard still lacks features needed for full support of parallel programming. Fortunately, for readers of this book, this means that TBB is still relevant and essential for effective threading in C++ and will likely stay relevant for many years to come.
It is very important to understand that we are not complaining about the C++ standard process. Adding features to a language standard is best done very carefully, with careful review. The C++11 standard committee, for instance, spent huge energy on a memory model. The significance of this for parallel programming is critical for every library that builds upon the standard. There are also limits to what a language standard should include, and what it should support. We believe that the tasking system and the flow graph system in TBB is not something that will directly become part of a language standard. Even if we are wrong, it is not something that will happen anytime soon.
Recent C++ Additions for Parallelism
As shown in Figure 1-1, the C++11 standard introduced some low-level, basic building blocks for threading, including std::async, std::future, and std::thread. It also introduced atomic variables, mutual exclusion objects, and condition variables. These extensions require programmers to do a lot of coding to build up higher-level abstractions – but they do allow us to express basic parallelism directly in C++. The C++11 standard was a clear improvement when it comes to threading, but it doesn’t provide us with the high-level features that make it easy to write portable, efficient parallel code. It also does not provide us with tasks or an underlying work-stealing task scheduler.
The C++17 standard introduced features that raise the level of abstraction above these low-level building blocks, making it easier for us to express parallelism without having to worry about every low-level detail. As we discuss later in this book, there are still some significant limitations, and so these features are not yet sufficiently expressive or performant – there’s still a lot of work to do in the C++ standard.
The most pertinent of these C++17 additions are the execution policies that can be used with the Standard Template Library (STL) algorithms. These policies let us choose whether an algorithm can be safely parallelized, vectorized, parallelized and vectorized, or if it needs to retain its original sequenced semantics. We call an STL implementation that supports these policies a Parallel STL.
Looking into the future, there are proposals that might be included in a future C++ standard with even more parallelism features, such as resumable functions, executors, task blocks, parallel for loops, SIMD vector types, and additional execution policies for the STL algorithms.
The Threading Building Blocks (TBB) Library
These features can be categorized into two large groups: interfaces for expressing parallel computations and interfaces that are independent of the execution model.
Parallel Execution Interfaces
When we use TBB to create parallel programs, we express the parallelism in the application using one of the high-level interfaces or directly with tasks. We discuss tasks in more detail later in this book, but for now we can think of a TBB task as a lightweight object that defines a small computation and its associated data. As TBB developers, we express our application using tasks, either directly or indirectly through the prepackaged TBB algorithms, and the library schedules these tasks on to the platform’s hardware resources for us.
It’s important to note that as developers, we may want to express different kinds of parallelism. The three most common layers of parallelism that are expressed in parallel applications are shown in Figure 1-3. We should note that some applications may contain all three layers and others may contain only one or two of them. One of the most powerful aspects of TBB is that it provides high-level interfaces for each of these different parallelism layers, allowing us to exploit all of the layers using the same library.
The message-driven layer shown in Figure 1-3 captures parallelism that is structured as relatively large computations that communicate to each other through explicit messages. Common patterns in this layer include streaming graphs, data flow graphs, and dependency graphs. In TBB, these patterns are supported through the Flow Graph interfaces (described in Chapter 3).
Finally, the Single Instruction, Multiple Data (SIMD) layer is where data parallelism is exploited by applying the same operation to multiple data elements simultaneously. This type of parallelism is often implemented using vector extensions such as AVX, AVX2, and AVX-512 that use the vector units available in each processor core. There is a Parallel STL implementation (described in Chapter 4) included with all of the TBB distributions that provides vector implementations, among others, that take advantage of these extensions.
TBB provides high-level interfaces for many common parallel patterns, but there may still be cases where none of the high-level interfaces matches a problem. If that’s the case, we can use TBB tasks directly to build our own algorithms.
The true power of the TBB parallel execution interfaces comes from the ability to mix them together, something usually called “composability.” We can create applications that have a Flow Graph at the top level with nodes that use nested Generic Parallel Algorithms. These nested Generic Parallel Algorithms can, in turn, use Parallel STL algorithms in their bodies. Since the parallelism expressed by all of these layers is exposed to the TBB library, this one library can schedule the corresponding tasks in an efficient and composable way, making best use of the platform’s resources.
One of the key properties of TBB that makes it composable is that it supports relaxed sequential semantics. Relaxed sequential semantics means that the parallelism we express using TBB tasks is in fact only a hint to the library; there is no guarantee that any of the tasks actually execute in parallel with each other. This gives tremendous flexibility to the TBB library to schedule tasks as necessary to improve performance. This flexibility lets the library provide scalable performance on systems, whether they have one core, eight cores, or 80 cores. It also allows the library to adapt to the dynamic load on the platform; for example, if one core is oversubscribed with work, TBB can schedule more work on the other cores or even choose to execute a parallel algorithm using only a single core. We describe in more detail why TBB is considered a composable library in Chapter 9.
Interfaces That Are Independent of the Execution Model
Unlike the parallel execution interfaces, the second large group of features in Figure 1-2 are completely independent of the execution model and of TBB tasks. These features are as useful in applications that use native threads, such as pthreads or WinThreads, as they are in applications that use TBB tasks.
These features include concurrent containers that provide thread-friendly interfaces to common data structures like hash tables, queues, and vectors. They also include features for memory allocation like the TBB scalable memory allocator and the cache aligned allocator (both described in Chapter 7). They also include lower-level features such as synchronization primitives and thread-local storage.
Using the Building Blocks in TBB
As developers, we can pick and choose the parts of TBB that are useful for our applications. We can, for example, use just the scalable memory allocator (described in Chapter 7) and nothing else. Or, we can use concurrent containers (described in Chapter 6) and a few Generic Parallel Algorithms (Chapter 2). And of course, we can also choose to go all in and build an application that combines all three high-level execution interfaces and makes use of the TBB scalable memory allocator and concurrent containers, as well as the many other features in the library.
Let’s Get Started Already!
Getting the Threading Building Blocks (TBB) Library
Follow links at www.threadingbuildingblocks.org or https://software.intel.com/intel-tbb to get a free version of the TBB library directly from Intel. There are precompiled versions available for Windows, Linux, and macOS. The latest packages include both the TBB library and an implementation of the Parallel STL algorithms that uses TBB for threading.
Visit https://github.com/intel/tbb to get the free, open-source version of the TBB library. The open-source version of TBB is in no way a lite version of the library; it contains all of the features of the commercially supported version. You can choose to checkout and build from source, or you can click “releases” to download a version that has been built and tested by Intel. At GitHub, pre-built and tested versions are available for Windows, Linux, macOS, and Android. Again, the latest packages for the pre-built versions of TBB include both the TBB library and an implementation of Parallel STL that uses TBB for threading. If you want the source code for Parallel STL, you will need to download that separately from https://github.com/intel/parallelstl .
You can download a copy of the Intel Parallel Studio XE tool suite https://software.intel.com/intel-parallel-studio-xe . TBB and a Parallel STL that uses TBB is currently included in all editions of this tool suite, including the smallest Composer Edition. If you have a recent version of the Intel C++ compiler installed, you likely already have TBB installed on your system.
We leave it to readers to select the most appropriate route for getting TBB and to follow the directions for installing the package that are provided at the corresponding site.
Getting a Copy of the Examples
All of the code examples used in this book are available at https://github.com/Apress/pro-TBB . In this repository, there are directories for each chapter. Many of the source files are named after the figure they appear in, for example ch01/fig_1_04.cpp contains code that matches Figure 1-4 in this chapter.
Writing a First “Hello, TBB!” Example
In both Figures 1-4 and 1-5, we use C++ lambda expressions to specify the functions. Lambda expressions are very useful when using libraries like TBB to specify the user code to execute as a task. To help review C++ lambda expressions, we offer a callout box “A Primer on C++ Lambda Expressions” with an overview of this important modern C++ feature.
A Primer on C++ Lambda Expressions
[ capture-list ] ( params ) -> ret { body }
capture-list is a comma-separated list of captures. We capture a variable by value by listing the variable name in the capture-list. We capture a variable by reference by prefixing it with an ampersand, for example, &v. And we can use this to capture the current object by reference. There are also defaults: [=] is used to capture all automatic variables used in the body by value and the current object by reference, [&] is used to capture all automatic variables used in the body as well as the current object by reference, and [] captures nothing.
params is the list of function parameters, just like for a named function.
ret is the return type. If ->ret is not specified, it is inferred from the return statements.
body is the function body.
Wherever we use a C++ lambda expression, we can substitute it with an instance of a function object like the preceding one. In fact, the TBB library predates the C++11 standard and all of its interfaces used to require passing in instances of objects of user-defined classes. C++ lambda expressions simplify the use of TBB by eliminating the extra step of defining a class for each use of a TBB algorithm.
Building the Simple Examples
Once we have written the examples in Figures 1-4 and 1-5, we need to build executable files from them. The instructions for building an application that uses TBB are OS and compiler dependent. However, in general, there are two necessary steps to properly configure an environment.
Steps to Set Up an Environment
- 1.
We must inform the compiler about the location of the TBB headers and libraries. If we use Parallel STL interfaces, we must also inform the compiler about the location of the Parallel STL headers.
- 2.
We must configure our environment so that the application can locate the TBB libraries when it is run. TBB is shipped as a dynamically linked library, which means that it is not directly embedded into our application; instead, the application locates and loads it at runtime. The Parallel STL interfaces do not require their own dynamically linked library but do depend on the TBB library.
We will now briefly discuss some of the most common ways to accomplish these steps on Windows and Linux. The instructions for macOS are similar to those for Linux. There are additional cases and more detailed directions in the documentation that ships with the TBB library.
Building on Windows Using Microsoft Visual Studio
If we download either the commercially supported version of TBB or a version of Intel Parallel Studio XE, we can integrate the TBB library with Microsoft Visual Studio when we install it, and then it is very simple to use TBB from Visual Studio.
On Windows systems, the TBB libraries that are dynamically loaded by the application executable at runtime are the “.dll” files. To complete step 2 in setting up our environment, we need to add the location of these files to our PATH environment variable. We can do this by adding the path to either our Users or System PATH variable. One place to find these settings is in the Windows Control Panel by traversing System and Security ➤ System ➤ Advanced System Settings ➤ Environment Variables. We can refer to the documentation for our installation of TBB for the exact locations of the “.dll” files.
Note
Changes to the PATH variable in an environment only take effect in Microsoft Visual Studio after it is restarted.
Once we have the source code entered, have Use TBB set to Yes, and have the path to the TBB “.dll”s in our PATH variable, we can build and execute the program by entering Ctrl-F5.
Building on a Linux Platform from a Terminal
Using the Intel Compiler
tbbvars and pstlvars Scripts
If we are not using the Intel C++ Compiler, we can use scripts that are included with the TBB and Parallel STL distributions to set up our environment. These scripts modify the CPATH, LIBRARY_PATH and LD_LIBRARY_PATH environment variables to include the directories needed to build and run TBB and Parallel STL applications. The CPATH variable adds additional directories to the list of directories the compiler searches when it looks for #include files. The LIBRARY_PATH adds additional directories to the list of directories the compiler searches when it looks for libraries to link against at compile time. And the LD_LIBRARY_PATH adds additional directories to the list of directories the executable will search when it loads dynamic libraries at runtime.
Let us assume that the root directory of our TBB installation is TBB_ROOT. The TBB library comes with a set of scripts in the ${TBB_ROOT}/bin directory that we can execute to properly set up the environment. We need to pass our architecture type [ia32|intel64|mic] to this script. We also need to add a flag at compile time to enable the use of C++11 features, such as our use of lambda expressions.
Even though the Parallel STL headers are included with all of the recent TBB library packages, we need to take an extra step to add them to our environment. Just like with TBB, Parallel STL comes with a set of scripts in the ${PSTL_ROOT}/bin directory. The PSTL_ROOT directory is typically a sibling of the TBB_ROOT directory. We also need to pass in our architecture type and enable the use of C++11 features to use Parallel STL.
Note
Increasingly, Linux distributions include a copy of the TBB library. On these platforms, the GCC compiler may link against the platform’s version of the TBB library instead of the version of the TBB library that is added to the LIBRARY_PATH by the tbbvars script. If we see linking problems when using TBB, this might be the issue. If this is the case, we can add an explicit library path to the compiler’s command line to choose a specific version of the TBB library.
For example:
g++ -L${TBB_ROOT}/lib/intel64/gcc4.7 –ltbb ...
We can add –Wl,--verbose to the g++ command line to generate a report of all of the libraries that are being linked against during compilation to help diagnose this issue.
Although we show commands for g++, except for the compiler name used, the command lines are the same for the Intel compiler (icpc) or LLVM (clang++).
Setting Up Variables Manually Without Using the tbbvars Script or the Intel Compiler
Sometimes we may not want to use the tbbvars scripts , either because we want to know exactly what variables are being set or because we need to integrate with a build system. If that’s not the case for you, skip over this section unless you really feel the urge to do things manually.
Since you’re still reading this section, let’s look out how we can build and execute on the command line without using the tbbvars script. When compiling with a non-Intel compiler, we don’t have the –tbb flag available to us, so we need to specify the paths to both the TBB headers and the shared libraries.
If the root directory of our TBB installation is TBB_ROOT, the headers are in ${TBB_ROOT}/include and the shared library files are stored in ${TBB_ROOT}/lib/${ARCH}/${GCC_LIB_VERSION}, where ARCH is the system architecture [ia32|intel64|mic] and the GCC_LIB_VERSION is the version of the TBB library that is compatible with your GCC or clang installation.
The underlying difference between the TBB library versions is their dependencies on features in the C++ runtime libraries (such as libstdc++ or libc++).
Typically to find an appropriate TBB version to use, we can execute the command gcc –version in our terminal. We then select the closest GCC version available in ${TBB_ROOT}/lib/${ARCH} that is not newer than our GCC version (this usually works even when we are using clang++). But because installations can vary from machine to machine, and we can choose different combinations of compilers and C++ runtimes, this simple approach may not always work. If it does not, refer to the TBB documentation for additional guidance.
A More Complete Example
The previous sections provide the steps to write, build, and execute a simple TBB application and a simple Parallel STL application that each print a couple of lines of text. In this section, we write a bigger example that can benefit from parallel execution using all three of the high-level execution interfaces shown in Figure 1-2. We do not explain all of the details of the algorithms and features used to create this example, but instead we use this example to see the different layers of parallelism that can be expressed with TBB. This example is admittedly contrived. It is simple enough to explain in a few paragraphs but complicated enough to exhibit all of the parallelism layers described in Figure 1-3. The final multilevel parallel version we create here should be viewed as a syntactic demonstration, not a how-to guide on how to write an optimal TBB application. In subsequent chapters, we cover all of the features used in this section in more detail and provide guidance on how to use them to get great performance in more realistic applications.
Starting with a Serial Implementation
A Note on Smart Pointers
One of the most challenging parts of programming in C/C++ can be dynamic memory management. When we use new/delete or malloc/free, we have to be sure we that we match them up correctly to avoid memory leaks and double frees. Smart pointers including unique_ptr, shared_ptr, and weak_ptr were introduced in C++11 to provide automatic, exception-safe memory management. For example, if we allocate an object by using make_shared, we receive a smart pointer to the object. As we assign this shared pointer to other shared pointers, the C++ library takes care of reference counting for us. When there are no outstanding references to our object through any smart pointers, then the object is automatically freed. In most of the examples in this book, including in Figure 1-7, we use smart pointers instead of raw pointers. Using smart pointers, we don’t have to worry about finding all of the points where we need to insert a free or delete – we can just rely on the smart pointers to do the right thing.
Adding a Message-Driven Layer Using a Flow Graph
As we will see in Chapter 3, several steps are needed to build and execute a TBB Flow Graph. First, a graph object, g, is constructed. Next, we construct the nodes that represent the computations in our data flow graph. The node that streams the images to the rest of the graph is a source_node named src. The computations are performed by the function_node objects named gamma, tint, and write. We can think of a source_node as a node that has no input and continues to send data until it runs out of data to send. We can think of a function_node as a wrapper around a function that receives an input and generates an output.
After the nodes are created, we connect them to each other using edges. Edges represent the dependencies or communication channels between nodes. Since, in our example in Figure 1-10, we want the src node to send the initial images to the gamma node, we make an edge from the src node to the gamma node. We then make an edge from the gamma node to the tint node. And likewise, we make an edge from the tint node to the write node. Once we complete construction of the graph’s structure, we call src.activate() to start the source_node and call g.wait_for_all() to wait until the graph completes.
When the application in Figure 1-10 executes, each image generated by the src node passes through the pipeline of nodes as described previously. When an image is sent to the gamma node, the TBB library creates and schedules a task to apply the gamma node’s body to the image. When that processing is done, the output is fed to the tint node. Likewise, TBB will create and schedule a task to execute the tint node’s body on that output of the gamma node. Finally, when that processing is done, the output of the tint node is sent to the write node. Again, a task is created and scheduled to execute the body of the node, in this case writing the image to a file. Each time an execution of the src node finishes and returns true, a new task is spawned to execute the src node’s body again. Only after the src node stops generating new images and all of the images it has already generated have completed processing in the write node will the wait_for_all call return.
Adding a Fork-Join Layer Using a parallel_for
Adding a SIMD Layer Using a Parallel STL Transform
We can further optimize our two computational kernels by replacing the inner j-loops with calls to the Parallel STL function transform. The transform algorithm applies a function to each element in an input range, storing the results into an output range. The arguments to transform are (1) the execution policy, (2 and 3) the input range of elements, (4) the beginning of the output range, and (5) the lambda expression that is applied to each element in the input range and whose result is stored to the output elements.
In Figure 1-12, each Image::Pixel object contains an array with four single byte elements, representing the blue, green, red, and alpha values for that pixel. By using the unseq execution policy, a vectorized loop is used to apply the function across the row of elements. This level of parallelization corresponds to the SIMD layer in Figure 1-3 and takes advantage of the vector units in the CPU core that the code executes on but does not spread the computation across different cores.
Note
Passing an execution policy to a Parallel STL algorithm does not guarantee parallel execution. It is legal for the library to choose a more restrictive execution policy than the one requested. It is therefore important to check the impact of using an execution policy – especially one that depends on compiler implementations!
While the examples we created in Figure 1-7 through Figure 1-12 are a bit contrived, they demonstrate the breadth and power of the TBB library’s parallel execution interfaces. Using a single library, we expressed message-driven, fork-join, and SIMD parallelism, composing them together into a single application.
Summary
In this chapter, we started by explaining why a library such as TBB is even more relevant today than it was when it was first introduced over 10 years ago. We then briefly looked at the major features in the library, including the parallel execution interfaces and the other features that are independent of the execution interfaces. We saw that the high-level execution interfaces map to the common message-driven, fork-join, and SIMD layers that are found in many parallel applications. We then discussed how to get a copy of TBB and verify that our environment is correctly set up by writing, compiling, and executing very simple examples. We concluded the chapter by building a more complete example that uses all three high-level execution interfaces.
We are now ready to walk through the key support for parallel programming in the next few chapters: Generic Parallel Algorithms (Chapter 2), Flow Graphs (Chapter 3), Parallel STL (Chapter 4), Synchronization (Chapter 5), Concurrent Containers (Chapter 6), and Scalable Memory Allocation (Chapter 7).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.