Pro TBB

Appendix A: History and Inspiration

In this Appendix, we offer two different complementary perspectives on the history of TBB: first a look at TBB’s history, and second a look at what preceded and inspired TBB. We hope you enjoy both, and that they deepen your understanding of why TBB has been called the most important new addition to parallel programming in the past decade, and we would not argue with that.

A Decade of “Hatchling to Soaring”

This first part of the appendix is adapted from a piece that James wrote on the tenth anniversary of TBB (mid-2016).

If you will be so kind as to indulge me, I will share my own thoughts about TBB. I have four things in mind to touch on as I ramble about TBB.

#1 TBB’s Revolution Inside Intel

TBB was our first commercially successful software product to embrace open source, and with continued leadership TBB has more recently moved to Apache licensing.

We knew we wanted to open source TBB from the start, but we were not ready when we launched in 2006. Open source projects were new to our small team, and to Intel. We focused first on creating a strong TBB and launching it as a product in mid-2006. After launching, we shifted our attention to revising our build system, cleaning up code (commenting!), and a dozen other things that would help us be inviting to others who would want to understand and contribute to our source code. We had a goal to be open source in mid-2007. A new problem arose – TBB became an immediate hit with customers. We were not secret with our customers about our desire to open source, and this only intensified their interest in TBB. Our success quickly became a problem inside Intel as some of our management asked the question “why give away the source code to such a successful product?” Armed with facts and figures from our team, I boldly presented a multitude of reasons why we should open up. That was a mistake, and I failed to get the needed permissions before 2006 ended. I licked my wounds, and we eventually realized we only needed to prove one thing: TBB would have far greater adoption if we open sourced it than if we did not. After all, developers bet the very future of their code when they adopt a programming model. Perhaps openness matters more for programming models than it does for most other software. While we understood this point, we had failed to articulate to our management that this was all that really mattered – and that it was all we needed to know to understand that we must open source TBB. Armed with this perspective, I surprised our senior V.P. who had to approve our proposal. I surprised her by showing up with only a single piece of paper with a simple graph on it, which offered a comparison of projected TBB adoption with and without open sourcing. We predicted that TBB would vanish and be replaced within five years if we didn’t offer this critical programming model via open source. We predicted great success if we did open source (we actually far underestimated the success, as it turns out). Senior Vice President Renee James listened to my 2-minute pitch, looked at me, and asked “Why didn’t you say this the first time? Of course we should do this.” I could have pointed out it was exactly slide 7 of the original way-too-long 20 slide presentation that I had presented 2 months earlier. I settled on “Thank you” and the rest is history. We choose the most popular open source licensing at the time: GPL v2 with classpath exception (important for C++ template libraries). Ten years later, we moved TBB to the Apache license. We have received a great deal of feedback from the community of users and contributors that this is the right license to use for TBB in our times .

#2 TBB’s First Revolution of Parallelism

The first revolution of parallelism offered by TBB was to fully embrace the task stealing abstraction while giving full C++ support with full composability .

OpenMP is incredibly important, but it is not composable. This is a mistake of epic proportions with long reaching ramifications, and it cannot be changed because OpenMP is so important and committed to compatibility. I am complicit in the OpenMP mistake along with everyone else who helped pull it together, review it, and promote it starting in 1997. We overlooked the importance that nested parallelism would have as the amount of hardware parallelism grew. It simply was not a concern in 1997.

Being composable is the most amazing feature of TBB. I cannot overstate the importance of never worrying about oversubscription, nested parallelism, and so on. TBB is gradually revolutionizing certain communities of developers that demand composability for their applications. The Intel Math Kernel Library (MKL) , which has long been based on OpenMP, offers a version built on top of TBB for exactly this reason. And the much newer, and open source, Intel Data Analytics Acceleration Library (DAAL) always uses TBB and the TBB-powered MKL. In fact, TBB is finding use in some versions of Python too.

Of course, the task stealing scheduler at the heart of TBB is the real magic. While HPC customers worry about squeezing out the ultimate performance while running an application on dedicated cores, TBB tackles a problem that HPC users never worry about: how can you make parallelism work well when you share the cores that you run upon? Imagine running on eight cores, but a virus checker happens to run on one core during your application’s run. That would never happen on a supercomputer, but it happens all the time on workstations and laptops! Without the dynamic nature of the TBB task stealing scheduler, such a program would simply be delayed by the full time that the virus checker stole… because it would effective delay every thread in the application. When using TBB on eight cores, an interruption, of duration TIME on one core, may delay the application by as little as TIME/8. This real world flexibility matters a lot!

Finally, TBB is a C++ template library that fully embraces bringing parallelism to C++. The dedication of TBB to C++ has helped inspire changes to the C++ standard. Perhaps our biggest dream of all is that TBB will one day only be the scheduler and the algorithms that use it. The many other things in TBB to help parallelize parts of STL , create truly portable locks and atomics, address short comings in memory allocations, and other features to bring parallelism to C++ can and should be part of the standard language eventually. Maybe even more of TBB? Time will tell .

#3 TBB’s Second Revolution of Parallelism

The second revolution of parallelism offered by TBB was to offer superior alternatives from bulk synchronous programming .

As much as we can praise the task stealing scheduler of TBB, the algorithms most often used in applications are organized with a lot of synchronization happening at runtime. This is a sign of the times in terms of how parallel programming has been done successfully for years. However, as the amount of parallelism has grown, this has become a great obstacle in the pursuit of scaling. A better approach is to express the flow of data and require a much more minimal level of synchronization. The TBB flow graph addition to TBB, is a leader in this critical new revolution in parallel programming. This type of thinking is required for any parallel programming model to support the future well .

#4 TBB’s Birds

On a very different note, I do get asked about the birds we’ve used. Of course, my original TBB book (2007) was an O’Reilly Nutshell book with its iconic design that always features an animal. O’Reilly made it clear to me, as the author, that they would pick the animal (a mysterious process). Undaunted, I did convey some ideas I had for animals that made sense to me. O’Reilly chose a beautifully drawn canary for the cover, a beautiful bird that was not an animal I had even considered. Everyone can have opinions, but soon our cry around Intel was “embrace the bird.” We can thank Belinda Adkisson for that reframe, and for the popular non-infringing “Chirp” bird that we used on t-shirts, stickers, and web sites. A cheery little bird remains our mascot for TBB. We have “embraced the bird.”

../images/466505_1_En_BookBackmatter_Figa_HTML.jpg

Embrace the Bird!

The colophon in the original book reads:

The animal on the cover of Intel Threading Building Blocks is a wild canary ( Serinus canaria ), a small songbird in the finch family. It is also known as an island canary or Atlantic canary because it is native to islands off western Europe, particularly Madeira, Azores, and the Canary Islands, for which the bird was named. The name comes from the Latin canaria (“of the dogs”), first used by Pliny the Elder in his Naturalis Historia because of the large dogs roaming the Islands. Canaries live in orchards, farmlands, and copses, and make their nests in bushes and trees.
Although the wild canary is darker and slightly larger than the domestic canary, it is otherwise similar in appearance. Its breast is yellow-green and its back is streaked with brown. Like many species, the male is more vibrantly colored than the female. The male also has a sweeter song. When the Spanish conquered the Islands in the 15th century, they domesticated the birds and began to breed them. By the 16th century, canaries were prized as pets throughout Europe. (Samuel Pepys writes about his “canary birds” in a 1661 diary entry.) Five hundred years of selective breeding have produced many canary varieties, including the bright yellow type common today. The small birds are popular pets because they can live up to 10 years, require little special attention, and are considered to have the most melodious song of all birds .
As late as the 1980s, coal miners used canaries as a warning system, with two birds in each coal pit. According to the U.S. Bureau of Mines, canaries were preferred to mice because they are more sensitive to fumes and more visibly show distress in the presence of gas. A canary in a mine would chirp all day, but if the carbon monoxide level rose, it would stop singing and sway on its perch before falling dead – warning the miners to get out fast.

I wrote the TBB book in the Spring of 2007 with a great deal of help from the TBB team. I would very much like to see us do a new TBB book in the upcoming years. I currently do not personally have the time to do it this year, but if enough people wanted to help… well, I think we could figure something out. I’m open to suggestions. Thank you, Michael and Rafael, for jumping in, first to do a tutorial on TBB together in 2017, and then to collaborate on this book starting in 2018!

James went on to welcome people to the tenth anniversary special edition of Intel’s The Parallel Universe magazine. It was the published just as James left Intel to become semi-retired (busier than ever). The special issue is available free online at https://software.intel.com/parallel-universe-magazine . You can read more about the history of TBB in the article “The Genesis and Evolution of Intel Threading Building Blocks” as related by its original architect in the same issue – including his two “regrets” in the initial TBB 1.0 design. Many interesting articles about TBB have been published in the magazine over the years .

../images/466505_1_En_BookBackmatter_Figb_HTML.jpg

Ten years of TBB – special issue of Intel’s The Parallel Universe magazine

Inspiration for TBB

This second part of this appendix is adapted from the historical notes that James produced for the first TBB book (2007) when TBB was only 1 year old. This is a look at what preceded and inspired TBB.

James originally titled this chapter of the original TBB book the “Epilogue” and referred to it as a bibliography. The editor at O’Reilly, Andy Oram – who James describes as the best editor on the planet – tempered James’ enthusiasm for a technical book having a bibliography or an epilogue and renamed this to be “Chapter 12 ” in the book and titled it “History and Related Projects.”

Figure A-1

Key influences on the design of Threading Building Blocks

Nothing that follows is required reading to learn how to use Threading Building Blocks. Instead, the following looks at some of the inspirations that shaped our thoughts at Intel and led to the design and implementation of Threading Building Blocks. A list of papers, articles, and books at the end of the chapter forms a bibliography to give some suggested further reading.

Note

The chapter originally contained a brief explanation of lambda functions, whose inclusion in C++ is advocated by Arch Robison, lead developer for Threading Building Blocks. We removed this because the Preface to this book explains lambdas because the C++ standard did add lambdas to the C++ standard a few years later in C++11. This simplified TBB syntax, and hence teaching TBB, exactly as Arch had hoped to do. Of course, it made the original TBB book seem a bit old because all our examples did not use C++11 since we published in 2007.

The information in this bibliography is likely to appeal most to those who want to contribute to Threading Building Blocks. There is much to be pondered in the design of Threading Building Blocks, and this chapter aims to clarify where to start.

Threading Building Blocks draws from a great many sources. Figure A-1 highlights the key influences of the past decade or so. The influences were in the form of inspiration and, other than McRT-Malloc, they have no actual source code connection. Influences prior to 1988 are left as an exercise for other historians.

Threading Building Blocks is unique because it rests on a few key decisions:

Support general C++ programs with existing compilers
Relaxed sequential execution
Use recursive parallelism and generic algorithms
Use task stealing

Relaxed Sequential Execution Model

TBB implements a relaxed sequential execution model . The word relaxed refers to the notion that serial programs are actually overly constrained by implicit serial dependencies (such as the program counter) and that the concurrent library introduces as much parallelism as possible without removing the ability to run sequentially.

We can think of this model as being as relaxed as possible and still being able to run correctly in a single thread. That is the goal!

Being able to run a program sequentially gives us a tremendous advantage when debugging our applications. It lets us debug common programming errors before dealing with any concurrency issues that need to be debugged. Our advice is simple: start with debugging in a sequential mode, and then run the program in parallel to debug concurrency issues. Programs designed to require concurrency would not give us this option. Furthermore, programs designed to require concurrency will have performance pitfalls when the number of required threads exceeds the number of hardware threads because time-slicing artifacts can hit hard .

Influential Libraries

1988, Chare Kernel, University of Illinois at Urbana-Champaign

In 1988, it was simply a C library. The key notion was to break a program into small bits of work, called chares, and the scheduler would take care of packing these efficiently (in both space and time) onto processors. Mapping tasks onto threads instead of programming threads directly is an important concept. The Chare Kernel was later extended with some features for marshaling to address distributed memory machines, becoming Charm++.

1993, Standard Template Library (STL) for C++, Hewlett-Packard

STL was presented in November 1993 to the ANSI/ISO C++ committee and HP made it freely available in 1994. It was adopted into the C++ standard. Arch Robison related: “I once heard Stepanov give a great talk on generic programming, where he went through how to write a really generic greatest-common-factor algorithm. [The paper at is similar to that talk, but with more mathematical emphasis.] In its full glory, generic programming is not just parametric types, but programming with concepts.” Works by Stepanov on STL and generic programming are listed later in this chapter. Note: Alexander Stepanov kindly wrote the foreword for the original book which included praise for the embracing of generic programming in TBB’s design.

1999, Java Specification Request #166 (JSR-166), Doug Lea

It was actually not standardized until later, but 1999 was the year Lea first introduced it. FJTask was an attempt to put Cilk-style parallelism into the stock Java library. It was proposed for JSR-166, but it did not make it into that standard .

2001, Standard Template Adaptive Parallel Library (STAPL), Texas A&M

STAPL introduced the notion of recursive parallel ranges (“pRanges”) and the concept of using these ranges instead of iterators to bind parallel generic algorithms to parallel containers. STL lacks a recursive range. STAPL is more complex than TBB because it encompasses distributed memory architectures typical of High-Performance Computing (HPC) . Furthermore, STAPL supports the specification of arbitrary execution order for parallel task graphs. This allows the use of multiple scheduling policies to optimize execution time.

2004, ECMA CLI parallel profile, Intel

This ECMA spec for the .NET virtual machine has classes for parallel iteration, designed by Arch Robison.

2006, McRT-Malloc, Intel Research

A scalable transactional memory allocator, McRT forms the basis of the Scalable Memory Allocator supplied with Threading Building Blocks. Sections 3 and 3.1 of a 2006 paper by Hudson, Saha, Adl-Tabatabai, and Hertzberg ( http://doi.acm.org/10.1145/1133956.1133967 ) describe roughly what is in the Scalable Memory Allocator in TBB .

Influential Languages

1994, Threaded-C, Massachusetts Institute of Technology

The Parallel Continuation Machine (PCM) was the runtime support for Threaded-C. It was a C-based package that provided continuation-passing-style threads on Thinking Machines Corporation’s Connection Machine Model CM-5 Supercomputer and used work stealing as a general scheduling policy to improve the load balance and locality of the computation. This language is not to be confused with the Threaded-C for EARTH from McGill University and the University of Delaware. PCM was briefly mentioned on page 2 of the original Cilk manual.

1995, Cilk, Massachusetts Institute of Technology

The first implementation of Cilk () was a direct descendent of PCM/Threaded-C. Cilk fixed the difficulty of programming continuation tasks and came up with methods to tailor task allocation to caches without knowing the size of the caches with cache-oblivious algorithms. Cilk is an extension of C that supports very efficient fork/join parallelism. Its space efficiency is discussed in http://supertech.csail.mit.edu/papers/cilkjpdc96.pdf . FFTW ( www.fftw.org ) is an example of a cache-oblivious algorithm .

Influential Pragmas

1997, OpenMP, by a consortium of major computer hardware and software vendors

OpenMP supports multiplatform, shared-memory parallel programming in C and Fortran, offering a standard set of compiler directives, library routines, and environment variables. Prior to OpenMP, many vendors had proprietary compiler directives with similar intent, but they lacked portability. OpenMP embodies a fork/join philosophy. See www.openmp.org .

1998, OpenMP Taskqueue, Kuck & Associates (KAI)

Proposed extensions for OpenMP to move beyond loops. A refinement of this proposal was adopted and added to OpenMP in 2008 (a year after our book) as of OpenMP 3.0 .

Influences of Generic Programming

Bjarne Stroustrup, creator of C++, originally said there were three fundamental styles supported by C++ – procedural programming , data abstraction , and object-oriented programming – but later added that generic programming had become a fourth style .

We can give credit to the Standard Template Library (STL) , created by Alexander Stepanov, for popularizing this style. It fits very well with the principles of C++, which favors abstraction and efficiency together.

In STL and Threading Building Blocks, algorithms are separated from containers. This means that an algorithm takes a recursive range and uses it to access elements within the container. The specific type of the container itself is unknown to the algorithm. This clear separation of containers and algorithms is the basic idea of generic programming. Separation of algorithms from containers means that template instantiations result in relatively little added code and generally only that which is actually going to be used.

Threading Building Blocks does embrace the same principles as STL but does it through the use of recursive ranges , not iterators . Iterators in STL (except for random-access iterators) are fundamentally sequential, and thus inappropriate for expressing parallelism .

Note

C++ Extensions for Ranges became a Technical Specification (ISO/IEC TS 21425:2017) and will almost certainly be part of C++20. The reasons for preferring ranges over iterators include the sequential nature of iterators which caused TBB to reject them in 2006.

Considering Caches

Threading Building Blocks is designed with caches in mind and works to limit the unnecessary movement of tasks and data. When a task has to be passed to a different processor core for execution, Threading Building Blocks moves the task with the least likelihood of having data in the cache for the processor core from which the task is stolen.

It is interesting to note that parallel Quicksort is an example in which caches beat maximum parallelism . Parallel Mergesort has more parallelism than parallel Quicksort. But parallel Mergesort is not an in place sort, and thus has twice the cache footprint that parallel Quicksort does. Hence, Quicksort usually runs faster in practice.

Keep data locality in mind when considering how to structure your program. Avoid using data regions sporadically when you can design the application to use a single set of data in focused chunks of time. This happens most naturally if you use data decomposition, especially at the higher levels in a program .

Considering Costs of Time Slicing

Time slicing enables there to be more logical threads than physical threads. Each logical thread is serviced for a time slice – a short period of time defined by the operating system during which a thread can run before being preempted – by a physical thread. If a thread runs longer than a time slice, as most do, it relinquishes the physical thread until it gets another turn. This chapter details the costs incurred by time slicing.

The most obvious cost is the time for context switching between logical threads. Each context switch requires that the processor save all its registers for the previous logical thread that it was executing, and load its registers with information for the next logical thread it runs.

A subtler cost is cache cooling. Processors keep recently accessed data in cache memory, which is very fast, but also relatively small compared to main memory. When the processor runs out of cache memory, it has to evict items from cache and put them back into main memory. Typically, it chooses the least recently used items in the cache. (The reality of set-associative caches is a bit more complicated, but this is not a cache primer.)

When a logical thread gets its time slice, as it references a piece of data for the first time, this data is pulled into cache, taking hundreds of cycles. If it is referenced frequently enough not to be evicted, each subsequent reference will find it in cache, and take only a few cycles. Such data is called hot in cache .

Time slicing undoes this because if Thread A finishes its time slice, and subsequently Thread B runs on the same physical thread, B will tend to evict data that was hot in cache for A, unless both threads need the data. When Thread A gets its next time slice, it will need to reload evicted data, at the cost of hundreds of cycles for each cache miss. Or worse yet, the next time slice for Thread A may be on a different physical thread that has a different cache altogether .

Another cost is lock preemption . This happens if a thread acquires a lock on a resource and its time slice runs out before it releases the lock. No matter how short a time the thread intended to hold the lock, it is now going to hold it for at least as long as it takes for its next turn at a time slice to come up. Any other threads waiting on the lock either busy-wait pointlessly or lose the rest of their time slice. The effect is called convoying because the threads end up “bumper to bumper” waiting for the preempted thread in front to resume driving .

Appendix B: TBB Précis

Our book thus far has been focused on teaching – this Appendix completes the picture by fully documenting it in a way that could not be done in the flow of teaching. Throughout the book, we left out some details about interfaces in order to keep the text more readable and manageable. In particular, some advanced concepts like TGC (Task Group Contexts) were not introduced until the second half of the book, so there is no mention of such parameters in the first half of the book.

Therefore, this is where we have placed complete definitions. This appendix is a series of precise summaries (précis) for each category of TBB interfaces. The plural of précis is also précis , hence the name we chose for this appendix.

We offer terse but complete definitions of the interfaces intended for use in applications. Also, when useful, a “Hello, World” illustrative example is included. It is our hope that the illustrative examples help those readers, who, like the authors, really want to see a simple example in action in order to more fully grasp the API. The example code in this appendix illustrates correct usage of the API, with sample output, without attempting to be awesome parallel programming examples. Keep in mind that code examples from this book, including this appendix, are available for download from the threadingbuildingblocks.org web site, so we encourage you to expand the silly little code examples from this appendix to explore the APIs yourself. The book Index and Table of Contents provide pointers to more detailed discussions regarding each of these items where examples to help with real parallel programming will be found.

Often, we emphasize possibly parallel in our terse descriptions, not because we doubt that TBB will give us great parallel execution and scaling , but because parallelism is never guaranteed. For instance, when a program runs on a single core machine the machine cannot give us parallel execution. Or, in a complex pipeline the individual stages (filters) may or may not run in parallel depending on the overall workload. This point about parallelism being only possible is subtle but important, and applies to all composable parallel solutions.Debug and Conditional Coding

Debug and Conditional Coding

To aid in debugging, TBB has macros, an environment variable (TBB_VERSION), and a function that reveal version and runtime information for conditional coding and debugging.

Figure B-1

Précis: Macros

Figure B-2

Précis: Environment variable

Figure B-3

Précis: TBB_runtime_interface_version

Figure B-4

Précis: Debugging macros

Preview Feature Macros

TBB may also include macros to enable “preview features.” From time to time, TBB release may include experimental features called “preview features” that will generally be disabled by default. Preview features are included side by side in full releases to enable feedback from users without committing to preserve the API. We do not cover preview features in this book – we refer you to the release notes to learn about any preview features that may be available in a given release of TBB.

Ranges

A Range can be recursively subdivided into two parts. Subdivision is done by calling one of the splitting constructors of Range. There are two types of splitting constructors:

1.
Basic splitting constructor : It is recommended that the division be done in nearly equal parts in this constructor, but it is not required. Splitting as evenly as possible typically yields the best parallelism.
2.
Proportional splitting constructor : This constructor is optional and can be omitted along with the is_splittable_in_proportion class variable. When using this type of constructor, for the best results, follow the given proportion with rounding to the nearest integer if necessary.

Ideally, a range is recursively splittable until the parts represent portions of work that are more efficient to execute serially, rather than split further. The amount of work represented by a Range typically depends upon higher-level context, hence a typical type that models a Range should provide a way to control the degree of splitting. For example, the template class blocked_range has a grainsize parameter that specifies the biggest range considered indivisible.

If the set of values has a sense of direction, then by convention, the splitting constructor should construct the second part of the range, and update its argument to be the first part of the range. This enables parallel_for , parallel_reduce , and parallel_scan algorithms, when running sequentially, to work across a range in the increasing order, typical of an ordinary sequential loop.

Since a Range declares both splitting and copy constructors, a default constructor for it will not be automatically generated. Therefore, it is necessary to either explicitly define a default constructor , or always use arguments to create an instance of a Range.

Figure B-5

Précis: Requirements on range concept

Partitioners

A partitioner specifies how a loop template should partition its work among threads.

The default behavior of the loop templates parallel_for , parallel_reduce , and parallel_scan tries to recursively split a range into enough parts to keep processors busy, not necessarily splitting as finely as possible. An optional partitioner parameter enables other behaviors to be specified, as shown in the table in Figure B-6 . Unlike other partitioners, an affinity_partitioner is passed by non-const reference because it is updated to remember where loop iterations run, hence the absence of a const qualifier. The template parallel_deterministic_reduce supports simple_partitioner and static_partitioner only because the other partitioners are, by their very nature, nondeterministic.

Figure B-6

Précis: Partitioners

Algorithms

Chapter 2 introduces TBB algorithms, and Chapter 16 dives deeper.

parallel_while is not documented in this book, as it has been deprecated in favor of the newer parallel_do .

Algorithm: parallel_do

Applies a function object body over a sequence [first,last) . Items may be processed in parallel. Additional work items can be added if the Body::operator is declared with a second argument of type parallel_do_feeder . The function terminates when body(x) returns for all items x that were in the input sequence or added by method parallel_do_feeder::add . The container form parallel_do(c,body) is equivalent to parallel_do(std::begin(c),std::end(c),body).

../images/466505_1_En_BookBackmatter_Fig7_HTML.png — Figure B-7
Précis: Algorithm: parallel_do

../images/466505_1_En_BookBackmatter_Fig8_HTML.png — Figure B-8
Précis: Algorithm: parallel_do example

../images/466505_1_En_BookBackmatter_Fig9_HTML.png — Figure B-9
Précis: Algorithm: parallel_do body requirements

Algorithm: parallel_for

The range parameter version provides the most general and efficient form of parallel iteration. It represents a possibly parallel execution of body over each value in range . Type Range must model the Range concept (requirements covered after the upcoming explanation and example code for parallel_for ). The first/last/step version represents a possibly parallel execution of the loop: for (auto i=first; i<last; i+=step) f(); If step is not provided, it is assumed to be one. The optional partitioner specifies a partitioning strategy. The partitioner argument specifies a partitioning strategy, as described in Chapter 12 . The task_group_context argument specifies the task group context to use for cancellation and exception handling.

../images/466505_1_En_BookBackmatter_Fig10_HTML.png — Figure B-10
Précis: Algorithm: parallel_for

../images/466505_1_En_BookBackmatter_Fig11_HTML.png — Figure B-11
Précis: Algorithm: example of parallel_for with blocked_range

../images/466505_1_En_BookBackmatter_Fig12_HTML.png — Figure B-12
Précis: Algorithm: example of parallel_for with first,last

../images/466505_1_En_BookBackmatter_Fig13_HTML.png — Figure B-13
Précis: Algorithm: parallel_for body requirements

Algorithm: parallel_for_each

Applies a function object f to each element in a sequence [first,last) or a container c , possibly in parallel.

../images/466505_1_En_BookBackmatter_Fig14_HTML.png — Figure B-14
Précis: Algorithm: parallel_for_each

../images/466505_1_En_BookBackmatter_Fig15_HTML.png — Figure B-15
Précis: Algorithm: parallel_for_each example

Algorithm: parallel_invoke

Evaluates f ₁ (), f ₂ (), ..., f _n () possibly in parallel . The arguments can be function objects, lambda expressions, or function pointers. Supports 2 or more arguments. The original TBB was limited to ten parameters, but thanks to variadic templates in C++11, there is no limit now.

../images/466505_1_En_BookBackmatter_Fig16_HTML.png — Figure B-16
Précis: Algorithm: parallel_invoke

../images/466505_1_En_BookBackmatter_Fig17_HTML.png — Figure B-17
Précis: Algorithm: parallel_invoke example

Algorithm: parallel_pipeline

The parallel_pipeline function is a strongly typed lambda-friendly interface for building and running pipelines, possibly in parallel. Because of strong typing and lambda support, it is recommended instead of the pipeline class. Flow Graph offers a far more general solution that should be used when that generality is needed.

../images/466505_1_En_BookBackmatter_Fig18_HTML.png — Figure B-18
Précis: Algorithm: parallel_pipeline

../images/466505_1_En_BookBackmatter_Fig19_HTML.png — Figure B-19
Précis: Algorithm: parallel_pipeline flow_control

../images/466505_1_En_BookBackmatter_Fig20_HTML.png — Figure B-20
Précis: Algorithm: parallel_pipeline *filter_t*

../images/466505_1_En_BookBackmatter_Fig21a_HTML.png — Figure B-21
Précis: Algorithm: parallel_pipeline example

../images/466505_1_En_BookBackmatter_Fig21b_HTML.png — Figure B-21
Précis: Algorithm: parallel_pipeline example

Algorithm: parallel_reduce and parallel_deterministic_reduce

Reductions (possibly in parallel) are supported in a deterministic (slightly slower, but with predictable and repeatable results which are at least useful while debugging!), and a nondeterministic version that will generally be slightly faster. Each reduction type supports template of two forms. A functional form is designed to be easy to use in conjunction with lambda expressions. An imperative form is designed to minimize copying of data. The functional form parallel_reduce(range, identity, func, reduction) performs a parallel reduction by applying func to subranges in range and reducing the results using binary operator reduction. It returns the result of the reduction. Parameter func and reduction can be lambda expressions. The imperative form parallel_reduce(range,body) performs parallel reduction of body over each value in range. Type Range must model the Range concept (requirements covered early in this Appendix). The body must model the requirements shown in the table in Figure B-22 .

../images/466505_1_En_BookBackmatter_Fig22_HTML.png — Figure B-22
Précis: Algorithm: parallel_[deterministic_]reduce

../images/466505_1_En_BookBackmatter_Fig23_HTML.png — Figure B-23
Précis: Algorithm: parallel_reduce example

../images/466505_1_En_BookBackmatter_Fig24_HTML.png — Figure B-24
Précis: Algorithm: parallel_deterministic_reduce example

../images/466505_1_En_BookBackmatter_Fig25_HTML.png — Figure B-25
Précis: Algorithm: parallel_[deterministic_]reduce body requirements

Algorithm: parallel_scan

The parallel_scan template function computes a (possibly parallel) prefix, also known as parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependences. A mathematical definition of the parallel prefix is as follows. Let × be an associative operation with left-identity element id _× . The parallel prefix of × over a sequence z ₀ , z ₁ , ... z _n-1 is a sequence y ₀ , y ₁ , y ₂ , ... y _n-1 where: y ₀ = id _× × z ₀ and y _i = y _i-1 × z _i . Parallel prefix performs this in parallel by reassociating the application of × and using two passes. It may invoke × up to twice as many times as the serial prefix algorithm. Even though it does more work, given the right grainsize the parallel algorithm can outperform the serial one because it distributes the work across multiple hardware threads. To obtain decent speedup, systems with more than two cores are recommended. The parallel_scan template function has two forms. The imperative form parallel_scan(range, body) implements parallel prefix generically. Type Range must model the Range concept (requirements covered early in this Appendix). The body must model the requirements in the following table.

../images/466505_1_En_BookBackmatter_Fig26_HTML.png — Figure B-26
Précis: Algorithm: parallel_scan

../images/466505_1_En_BookBackmatter_Fig27_HTML.png — Figure B-27
Précis: Algorithm: parallel_scan example

../images/466505_1_En_BookBackmatter_Fig28_HTML.png — Figure B-28
Précis: Algorithm: parallel_scan body requirements

Algorithm: parallel_sort

Sorts a sequence or a container, possibly in parallel . The sort is neither stable nor deterministic – relative ordering of elements with equal keys is not preserved and not guaranteed to repeat if the same sequence is sorted again. The requirements on the iterator and sequence are the same as for std::sort . Specifically, RandomAccessIterator must be a random-access iterator, and its value type T must model the requirements in the table in Figure B-29 . A call parallel_sort(begin,end,comp) sorts the sequence [begin, end) using the argument comp to determine relative orderings. If comp(x,y) returns true then x appears before y in the sorted sequence. A call parallel_sort( begin, end ) is equivalent to parallel_sort(begin,end,std::less<T>) . A call parallel_sort(c[,comp]) is equivalent to parallel_sort(std::begin(c),std::end(c)[,comp]).

../images/466505_1_En_BookBackmatter_Fig29_HTML.png — Figure B-29
Précis: Algorithm: parallel_sort

../images/466505_1_En_BookBackmatter_Fig30_HTML.png — Figure B-30
Précis: Algorithm: parallel_sort example

../images/466505_1_En_BookBackmatter_Fig31_HTML.png — Figure B-31
Précis: Algorithm: parallel_sort iterator requirements

Algorithm: pipeline

A pipeline represents pipelined application of a series of filters to a stream of items, possibly in parallel. Each filter operates in a particular mode: parallel, serial in-order, or serial out-of-order. A pipeline contains one or more filters. Alternatives to pipeline are parallel_pipeline (recommend because it is strongly typed with a lambda-friendly interface) and Flow Graph (recommended because it is far more general, and should be used when that generality is needed).

../images/466505_1_En_BookBackmatter_Fig32_HTML.png — Figure B-32
Précis: Algorithm: pipeline

../images/466505_1_En_BookBackmatter_Fig33a_HTML.png — Figure B-33
Précis: Algorithm: pipeline example

../images/466505_1_En_BookBackmatter_Fig33b_HTML.png — Figure B-33
Précis: Algorithm: pipeline example

Flow Graph

Chapter 3 introduces TBB flow graph, Chapter 17 dives deeper, and Chapters 18 and 19 look at heterogeneous support. Other chapters provide deeper looks at the many controls and considerations in making highly refined use of TBB.

It is possible to create graphs that are highly scalable, but it is also possible to create graphs that are completely sequential. There are three types of components used to implement a graph:

1.
Graph object: The owner of the tasks created on behalf of the flow graph. Users can wait on the graph if they need to wait for the completion of all of the tasks related to the flow graph execution. One can also register external interactions with the graph and run tasks under the ownership of the flow graph.
2.
Nodes: Invoke user-provided function objects or manage messages flow to/from other nodes. There are predefined nodes that buffer, filter, broadcast, or order items as they flow through the graph.
3.
Edges: The connections between the nodes, managed by calls to the make_edge and remove_edge functions.

Flow Graph: graph class

../images/466505_1_En_BookBackmatter_Fig34_HTML.png — Figure B-34
Précis: Flow Graph: graph class

Flow Graph: ports and edges

Flow Graph provides an API to manage connections between the nodes. For nodes that have more than one input or output port (e.g., join_node ), making a connection requires that we specify a certain port by using special helper functions.

../images/466505_1_En_BookBackmatter_Fig35_HTML.png — Figure B-35
Précis: Flow Graph: ports and edges

Flow Graph: nodes

Functional nodes (Figure B-37 ) do computations in response to input messages (if any) and send the result or a signal to their successors. The list of types for Control Flow nodes is provided in Figure B-38 . A special type of control flow node are the Join nodes , the different join policies available for join_node are described in Figure B-39 . Buffering nodes (Figure B-40 ) are designed to accumulate input messages and pass them to successors in a predefined order, depending on the node type. A pictorial presentation of the node types is given in Figure B-36 .

Some nodes create or use messages that are composites of other messages. Multiport nodes use tuples to manage their ports. These include join_node, multifunction_node, split_node, indexer_node and composite_node. Multiport nodes that send or receive tuples are join_node and split_node .

tbb::flow::tuple vs. std::tuple

These days (using C++11 or later compilers), using std::tuple is recommended. TBB introduced tbb::flow::tuple before C++11, and still supports it if needed. However, if std::tuple is available as part of the STL, then tbb::flow::tuple is automatically typedef ed to std::tuple .

../images/466505_1_En_BookBackmatter_Fig36a_HTML.jpg — Figure B-36
Précis: Flow Graph node types (see also Chapters 3 , 17 , 18 , and 19 )

../images/466505_1_En_BookBackmatter_Fig36b_HTML.jpg — Figure B-36
Précis: Flow Graph node types (see also Chapters 3 , 17 , 18 , and 19 )

../images/466505_1_En_BookBackmatter_Fig37a_HTML.png — Figure B-37
Précis: Functional node types available in the TBB flow graph interface. These nodes are discussed in Chapter 3 , except async_node which is described in Chapter 18 .

../images/466505_1_En_BookBackmatter_Fig37b_HTML.png — Figure B-37
Précis: Functional node types available in the TBB flow graph interface. These nodes are discussed in Chapter 3 , except async_node which is described in Chapter 18 .

../images/466505_1_En_BookBackmatter_Fig38_HTML.png — Figure B-38
Précis: Control flow node types available in the TBB flow graph interface. These nodes are discussed in Chapter 3 .

../images/466505_1_En_BookBackmatter_Fig39_HTML.png — Figure B-39
Précis: Join node policies available in the TBB flow graph interface. These nodes are discussed in Chapter 3 .

../images/466505_1_En_BookBackmatter_Fig40a_HTML.png — Figure B-40
Précis: Buffering node types available in the TBB flow graph interface. These nodes are discussed in Chapter 3 .

../images/466505_1_En_BookBackmatter_Fig40b_HTML.png — Figure B-40
Précis: Buffering node types available in the TBB flow graph interface. These nodes are discussed in Chapter 3 .

Graph Policy (namespace)

We can give guidance for scheduling to functional nodes in our graphs. Lightweight policies for functional nodes can help reduce the overhead associated with its execution scheduling. Lightweight policies should only be applied on a per-node basis after careful evaluation. Having multiple successors using the lightweight policy for a particular node can significantly reduce the parallelism available in the graph, and hence severely limit scaling . Cycles in a flow graph consisting only of nodes with lightweight policies may possibly result in deadlock .

A lightweight policy is used to indicate that the body of the node contains a small amount of work and should, if possible, be executed without the overhead of scheduling a task. All flow graph functional nodes, except for source_node , support the lightweight policy as a possible value of the optional Policy template parameter. To use the lightweight policy, specify the Policy template parameter of the node to queueing_lightweight, rejecting_lightweight , or lightweight . For functional nodes that have a default value for the Policy template parameter, specifying the lightweight policy results in extending the behavior of the default value of Policy with the behavior defined by the lightweight policy. For example, if the default value of Policy is queueing , specifying lightweight as the Policy value is equivalent to specifying queueing_lightweight . See Chapter 17 for more discussion of using lightweight Policies.

Policy values are listed in Figure B-41 . Note there is a policy tbb::flow::reserving that is not listed because it is a special policy exclusively for join node that has no application for async, continue, function and multifunction nodes.

../images/466505_1_En_BookBackmatter_Fig41_HTML.png — Figure B-41
Précis: Flow Graph: policy values

../images/466505_1_En_BookBackmatter_Fig42a_HTML.png — Figure B-42
Précis: Flow Graph: member operations on nodes

../images/466505_1_En_BookBackmatter_Fig42b_HTML.png — Figure B-42
Précis: Flow Graph: member operations on nodes

../images/466505_1_En_BookBackmatter_Fig43a_HTML.png — Figure B-43
Précis: Flow Graph: example

../images/466505_1_En_BookBackmatter_Fig43b_HTML.png — Figure B-43
Précis: Flow Graph: example

Memory Allocation

Chapter 7 covers TBB memory allocators completely within a single chapter; this section of this Appendix is a terse review of the supported APIs, and some notes. Memory allocators supplied by TBB do not depend upon the rest of TBB, and therefore can be used with any threading model.

Figure B-44

Précis: Memory Allocation: Memory allocator template classes

Figure B-45

Précis: Memory Allocation: Memory allocator functions (C interfaces)

Figure B-46

Précis: Memory Allocation: Memory allocator special controls (C interfaces)

Figure B-47

Précis: Memory Allocation: Allocator Concept

Figure B-48

Précis: Memory Allocation: memory_pool and fixed_pool template classes

Figure B-49

Précis: Memory Allocation: Memory Pool Concept

Containers

Chapter 6 covers TBB highly concurrent container classes completely within a single chapter; this section of the Appendix is a terse review of the supported APIs, and some notes. Highly concurrent container classes supplied by TBB do not depend on the rest of TBB, and therefore can be used with any threading model. Chapter 6 contains usage examples, so none are shown here.

Typical C++ STL containers do not permit concurrent updating; attempts to modify them concurrently often result in corrupting the container. STL containers can be wrapped in a mutex to make them safe for concurrent access, by letting only one thread operate on the container at a time, but that approach eliminates concurrency, thus restricting parallel speedup.

Therefore, TBB offers concurrent containers to allow multiple threads to concurrently access and update items in the container. TBB uses either fine-grained locking , or lock-free techniques , to provide for concurrency. This does come at a cost, typically in the form of slightly higher overheads than regular STL containers. Therefore, we should use highly concurrent containers when the speedup from any additional concurrency will outweigh slower sequential performance.

As with most objects in C++, the constructor or destructor of a container object must not be invoked concurrently with another operation on the same object. Otherwise the resulting race may cause the operation to be executed on an undefined object.

Figure B-50

Précis: Containers: Comparison of Map Classes

Figure B-51

Précis: Containers: concurrent hash map (hash table) class

Figure B-52

Précis: Containers: Concurrent Unordered Map, Multimap, Set, and Multiset Classes

Figure B-53

Précis: Containers: Requirements on Container Range Concept

Figure B-54

Précis: Containers: Concurrent Queue Class

Figure B-55

Précis: Containers: Concurrent Bounded Queue Class

Figure B-56

Précis: Containers: Concurrent Priority Queue Class

Figure B-57

Précis: Containers: Concurrent Vector Class

Synchronization

Chapter 5 covers TBB synchronization in a single chapter; this section of this Appendix is a terse review of the supported APIs, and some notes. Thread Local Storage, which is also covered in Chapter 5 , is reviewed after this section. TBB supplies a platform independent mutual exclusion and atomic operations. These predate support in the C++ standard. With the addition of similar capabilities in the C++ standard, TBB supports C++11 interfaces (defined in namespace std , not tbb ) for condition variables and scoped locking , with a few differences:

TBB support is available regardless of whether full C++11 language support is available or not on a given system.
The implementation uses the tbb::tick_count interface instead of the C++11 <chrono> interface.
The implementation will throw std::runtime_error if C++11 std::system_error is not available.
The implementation omits or approximates features requiring C++11 language support such as constexpr or explicit operators.
The implementation works in conjunction with tbb::mutex wherever the C++11 specification calls for a std::mutex .
notify_all_at_thread_exit() is not supported.

Figure B-58

Précis: Synchronization: comparison of various mutexes

Figure B-59

Précis: Synchronization: C++11 mutex support

Figure B-60

Précis: Synchronization: mutex examples

Figure B-61

Précis: Synchronization: Mutex Concepts

Figure B-62

Précis: Synchronization: atomic<T> class

Thread Local Storage (TLS)

Chapter 5 covers TBB thread local storage as part of the broader coverage of synchronization; this section of the Appendix is a terse review of the supported APIs. Thread local storage, for our purposes here, refers to having privatized copies of data on each thread. An important aspect of TBB is that we do not know how many threads are being used at any given time, so thread local storage is presented in a manner that automatically matches the number of threads created at runtime by the TBB library – a number that can vary greatly depending on the platform our program runs upon.

TBB provides two template classes for thread local storage. Both provide a thread local element per thread. Both lazily create the elements on demand. They differ in their intended use models:

combinable provides thread local storage for holding per-thread subcomputations that will later be reduced to a single result.
enumerable_thread_specific provides thread local storage that acts like an STL container with one element per thread. The container permits iterating over the elements using the usual STL iteration idioms.

Template class flatten2d assists a common idiom where an enumerable_thread_specific represents a container partitioner across threads. This is supplied because it is very useful when debugging code.

Figure B-63

Précis: TLS: combinable example

Figure B-64

Précis: TLS: combinable example

Figure B-65

Précis: TLS: enumerable_thread_specific

Figure B-66

Précis: TLS: enumerable_thread_specific example

Figure B-67

Précis: TLS: flatten2d example

Figure B-68

Précis: TLS: flatten2d example

Timing

TBB supplies a platform independent, thread-aware, manner to get a high-resolution time stamp (TBB implementations seek to utilize the highest resolution timing available on any given platform). This was included in TBB to assist with debugging and tuning activities that are natural for any programmer to do when adding and tuning parallel code.

Figure B-69

Précis: Timing: tick_count class

Task Groups: Use of the Task Stealing Scheduler

This section (“Task Groups”) is a summary of the supported high-level APIs to the TBB task scheduler, while the next section (“Task Scheduler”) covers the more numerous low-level APIs. The high-level APIs let us easily create groups of potentially parallel tasks from functors or lambda expressions (Preface, Chapters 1 - 3 ). Collectively, the TBB task scheduler forms the basis of all TBB algorithms (Chapters 2 and 16 ) and TBB flow graph (Chapters 3 and 17 – 19 ).

Functor arguments for the various methods in this section should supply at a minimum, a copy constructor, a destructor, and have an evaluate functor.

Figure B-70

Précis: Task Groups: [structured_]task_group

Figure B-71

Précis: Task Groups: task_group example

Task Scheduler: Fine Control of the Task Stealing Scheduler

This section (“Task Scheduler”) is a summary of the (numerous) supported low-level APIs to the TBB task scheduler, while the prior section (“Task Groups”) covers the high-level APIs. The key four concepts in this section are the task scheduler, task arenas, tasks, and floating-point controls. The low-level interfaces permit more detailed control, such as control over exception propagation (Chapter 15 ), priorities (Chapter 4 ), isolation (Chapter 12 ), and affinity (Chapters 13 and 20 ). Collectively, the TBB task scheduler forms the basis of all TBB algorithms (Chapters 2 and 16 ) and TBB flow graph (Chapters 3 and 17 – 19 ).

Figure B-72

Précis: Task Scheduler: task_scheduler_init class

Figure B-73

Précis: Task Scheduler: task_scheduler_init example

Figure B-74

Précis: Task Scheduler: task_arena class

Figure B-75

Précis: Task Scheduler: this_task_arena members

Figure B-76

Précis: Task Scheduler: task class

Floating-Point Settings

For applications that need to control CPU-specific settings for floating-point computations, there are two ways to propagate desired settings to tasks executed by the TBB task scheduler:

When a task_arena or the task scheduler for a given application thread is initialized, it captures the current floating-point settings of the thread.
The class task_group_context has a method to capture the current floating-point settings.

By default, worker threads use the floating-point settings captured during initialization of an application thread’s implicit arena or explicit task_arena . These settings are applied to all parallel computations within the task_arena or started by the application thread, until that thread terminates its task scheduler instance. If the thread later re-initializes the task scheduler, new settings are captured.

For finer control over floating point behavior, a thread may capture the current settings in a task group context. It can be done at context creation if a special flag is passed to the constructor:

task_group_context ctx(

task_group_context::isolated,

task_group_context::default_traits |

task_group_context::fp_settings );

or by a call to the method capture_fp_settings :

task_group_context ctx;

ctx. capture_fp_settings ();

The task group context can then be passed to most Intel TBB parallel algorithms (including tbb::flow::graph ) to ensure that all tasks related to this algorithm use the specified floating-point settings. It is possible to execute parallel algorithms with different floating-point settings captured to separate contexts, even at the same time.

Floating-point settings captured to a task group context prevail over the settings captured during task scheduler initialization. Thus, if a context is passed to a parallel algorithm then floating-point settings captured to the context are used. Otherwise, if floating-point settings are not captured to the context, or a context is not explicitly specified, then the settings captured during task scheduler initialization are used.

In a nested call to a parallel algorithm not using a task group context with explicitly captured floating-point settings, the settings from the outer level are used. If neither of the outer level contexts captured floating-point settings, then the settings captured during task scheduler initialization are used.

It is guaranteed that

Floating-point settings captured to a task group context or at the moment of task scheduler initialization are applied to all tasks executed by the task scheduler.
An invocation of an Intel TBB parallel algorithm does not visibly modify the floating-point settings of the calling thread, even if the algorithm is executed with different settings.

These guarantees only apply in the following conditions:

The user code inside a task either does not change floating-point settings, or any modifications are reverse by restoring the previous settings before the end of the task.
Intel TBB task scheduler observers are not used to set or modify floating-point settings.

Otherwise, the stated guarantees are not valid and the behavior related to floating-point settings is undefined.

Exceptions

Chapter 15 covers exception handling; this section of this Appendix is a terse review of the exceptions which can be thrown by TBB components, and the subclass of std:: exception called tbb::tbb_exception , which TBB uses to propagate exceptions between TBB tasks to make exception handling seamless within a program using TBB.

TBB propagates exceptions along logical paths in a tree of tasks. Because these paths cross between thread stacks, support for moving an exception between stacks is necessary.

When an exception is thrown out of a task, it is caught inside the Intel TBB runtime and handled as follows:

1.
If the cancellation group for the task has already been canceled, the exception is ignored.
2.
Otherwise, the exception is captured and the group is canceled.
3.
The captured exception is rethrown from the root of the cancellation group after all tasks in the group have completed or have been successfully canceled.

An exact exception is captured in modern versions of TBB (built with compilers with C++11 support). When supporting non-C++11 compilers , TBB has backward compatibility support via tbb::captured_exception to approximate the original exception, which is no longer relevant and therefore not covered in this book.

Figure B-77

Précis: Exceptions that can be thrown by TBB

Figure B-78

Précis: Exceptions: Exception handling example

Figure B-79

Précis: Exceptions: tbb_exception and movable_exception classes

Threads

TBB supports a portable “thread” API which is nothing more than a wrapper for whatever native threads a platform supports. This predates the existence of std::thread in C++, and offers no advantage over std::thread .

Several mentions are made of the evils of oversubscription in this book, and a short mention in the Preface is made to the virtues of careful oversubscription. This is because TBB is designed to support tasks for computationally intensive code. It is important to note that concurrency may be used effectively to hide latency of operations such as I/O. TBB tasks are a poor place to put I/O, because TBB does not preempt threads that are stalled for I/O – that is a function already supplied by modern operating systems. For this reason, TBB did add a portable API for adding additional threads for work which is not computationally intensive and which would benefit from oversubscription. The TBB developers were among those who lobbied for the addition of this capability into C++, which happened in the C++11 standard.

For a related discussion of controlling the number of threads used by TBB, we recommend reading Chapter 11 .Here are a few notes about TBB’s implementation of std::thread , for support of legacy TBB applications. We advise the use of the standard C++11 std::thread in all new code.

Figure B-80

Précis: Threads: comparison of C++11 and TBB Thread Classes

Parallel STL

Chapter 4 discusses Parallel STL; this section of this Appendix summarizes Parallel STL.

Since Parallel STL is an emerging part of C++17 , implementations are relatively new and optimized versions are relatively new as we are finishing this book. Intel has produced an implementation of Parallel STL that is available as a part of Intel Parallel Studio XE and Intel System Studio, and is included with the binary distributions of TBB.

The Parallel STL available with TBB already includes support for Parallel STL that includes C++17 as well as features likely to make it into C++2x (specifically the unsequenced execution policy specified in C++ committee’s Parallelism Technical Specification (TS) version 2 dated February 2018). Parallel STL offers efficient support for both parallel and vectorized execution of algorithms. For sequential execution , it relies on an available implementation of the C++ standard library. As time passes, and C++2x takes shape, Parallel STL support will be adjusted and expanded as needed.

As we go to publication with this book, Intel has been maintaining source code for Parallel STL as open source on github ( https://github.com/intel/parallelstl ), and Parallel STL is being openly discussed for possible inclusion in libstdc++ (gnu) and libc++ (LLVM).

For now, the optimized Parallel STL has us add #include <pstl/execution> to our code and then add a subset of the following set of lines, depending on the type of STL algorithm we intend to use:

#include <pstl/algorithm>
#include <pstl/numeric>
#include <pstl/memory>

Figure B-81

Précis: Parallel STL: simplistic code snippet

Unsequenced (“ unseq ” and “ par_unseq” ) is a strange beast: An unsequenced execution policy means that function calls are completely unsequenced with respect to each other – the key implication being that they can be interleaved on a single thread of execution. In all other situations in C++, the standard requires that they are indeterminately sequenced (cannot interleave on a single thread of execution). Because of that, code should not perform any other vectorization-unsafe operations. In general, any action in one invocation which sets up a synchronization need with a different invocation would be vectorization-unsafe, including memory alloc / free and mutex acquisition.

Figure B-82

Précis: Parallel STL: execution policies

Glossary

Abstraction: In the case of TBB, abstraction serves to separate the work into work appropriate for a programmer and work best left to a runtime. The goal of such an abstraction is to deliver scalable high performance on a variety of multicore and many-core systems, and even heterogeneous platforms, without requiring rewriting of the code. This careful division of responsibilities leaves the programmer to expose opportunities for parallelism, and the runtime responsible for mapping the opportunities to the hardware. Code written to the abstraction will be free of parameterization for cache sizes, number of cores, and even consistency of performance from processing unit to processing unit.

Affinity: The specification of methods to associate a particular software thread to a particular hardware thread usually with the objective of getting better or more predictable performance. Affinity specifications include the concept of being maximally spread apart to reduce contention(scatter), or to pack tightly (compact) to minimize distances for communication. OpenMP supports a rich set of affinity controls at various levels from abstract to full manual control. Fortran 2008 does not specify controls, but Intel reuses the OpenMP controls for “do concurrent.” Threading Building Blocks (TBB) provides an abstract loop-to-loop affinity biasing capability.

Algorithm is a term that TBB has used in association with general, reusable solution to a common parallel programming problems. TBB, and this book, therefore uses the term ‘parallel algorithm’ when ‘parallel pattern’ would also be an appropriate description.

Amdahl’s Law: Speedup is limited by the nonparallelizable serial portion of the work. A program where two thirds of the program can be run in parallel and one third of the original nonparallel program cannot be sped up by parallelism will find that speedup can only approach 3× and never exceed it assuming the same work is done. If scaling the problem size places more demands on the parallel portions of the program, then Amdahl’s Law is not as bad as it may seem. See Gustafson’s law .

Atom is touted as a hackable text editor for the twenty-first century, and it is open source. Rafa says “I also love emacs, but now Atom is winning this space on my Mac.” Compare to the vi and emacs editors.

Atomic operation is an operation that is guaranteed to appear as if it occurred indivisibly without interference from other threads. For example, a processor might provide a memory increment operation. This operation needs to read a value from memory, increment it, and write it back to memory. An atomic increment guarantees that the final memory value is the same as would have occurred if no other operations on that memory location were allowed to happen between the read and the write.

Barrier: When a computation is broken into phases, it is often necessary to ensure that all threads complete all the work in one phase before any thread moves onto another phase. A barrier is a form of synchronization that ensures this: threads arriving at a barrier wait there until the last thread arrives, then all threads continue. A barrier can be implemented using atomic operations. For example, all threads might try to increment a shared variable, then block if the value of that variable does not equal the number of threads that need to synchronize at the barrier. The last thread to arrive can then reset the barrier to zero and release all the blocked threads.

Block can be used in two senses: (1) a state in which a thread is unable to proceed while it waits for some synchronization event and (2) a region of memory. The second meaning is also used in the sense of dividing a loop into a set of parallel tasks of a suitable granularity.

Cache is a part of memory system that stores copies of data temporarily in a fast memory so that future uses for that data can be handled more quickly than if the request had to be fetched again from a more distant storage. Caches are generally automatic and are designed to enhance programs with temporal locality and/or spatial locality. Caching systems in modern computers are usually multileveled.

Cache friendly is a characteristic of an application in which performance increases as problem size increases but then levels off as the bandwidth limit is reached.

Cache lines are the units in which data are retrieved and held by a cache, which in order to exploit spatial locality are generally larger than a word. The general trend is for increasing cache line sizes, which are generally large enough to hold at least two double-precision floating-point numbers, but unlikely to hold more than eight on any current design. Larger cache lines allow for more efficient bulk transfers from main memory but worsen certain issues including false sharing, which generally degrades performance.

Cache oblivious algorithm is any algorithm which performs well, without modification, on multiple machines memory organization such as different levels of cache having different sizes. Since such algorithms are carefully designed to exhibit compact memory reuse, it seems like it would have made more sense to call such algorithms cache agnostic. The term oblivious is a reference to the fact that such algorithms are not aware of the parameters of the memory subsystem, such as the cache sizes or relative speeds. This is in contrast with earlier efforts to carefully block algorithms for specific cache hardware.

Cache unfriendly is a characteristic of an application in which the memory footprint of the workloads needs to be optimal. In this case, we see that performance stays constant or increases as problem size reaches the optimal size and then performance decreases as problem size increases. In these workloads, there is a definite “sweet spot.”

Clusters are a set of computers with distributed memory communicating over a high-speed interconnect. The individual computers are often called nodes . TBB is used at the node level within a cluster, although multiple nodes are commonly programmed with TBB and then connected (usually with MPI).

Communication: Any exchange of data or synchronization between software tasks or threads. Understanding that communication costs are often a limiting factor in scaling is a critical concept for parallel programming.

Composability: The ability to use two components in concert with each other without causing failure or unreasonable conflict (ideally no conflict). Limitations on composability, if they exist, are best when completely diagnosed at build time instead of requiring any testing. Composability problems that manifest only at runtime are the biggest problem with non-composable systems. Can refer to system features, programming models, or software components.

Concurrent means that things are logically happening simultaneously. Two tasks that are both logically active at some point in time are considered to be concurrent. This is in contrast with parallel .

Core: A separate subprocessor on a multicore processor. A core should be able to support (at least one) separate and divergent flow of control from other cores on the same processor.

Data parallelism is an approach to parallelism that attempts to be oriented around data rather than tasks. However, in reality, successful strategies in parallel algorithm development tend to focus on exploiting the parallelism in data, because data decomposition (generating tasks for different units of data) scales, but functional decomposition (generation of heterogeneous tasks for different functions) does not. See Amdahl’s Law and Gustafson-Barsis’ law.

Deadlock is a programming error. Deadlock occurs when at least two tasks wait for each other and each will not resume until the other task proceeds. This happens easily when code requires locking multiple mutexes. For example, each task can be holding a mutex required by the other task.

Deterministic refers to a deterministic algorithm which is an algorithm that behaves predictably. Given a particular input, a deterministic algorithm will always produce the same output. The definition of what is the “same” may be important due to limited precision in mathematical operations and the likelihood that optimizations including parallelization will rearrange the order of operations. These are often referred to as “rounding” differences, which result when the order of mathematical operations to compute the answer differ between the original program and the final concurrent program. Concurrency is not the only factor that can lead to nondeterministic algorithms but in practice it is often the cause. Use of programming models with sequential semantics and eliminating data races with proper access controls will generally eliminate the major effects of concurrency other than the “rounding” differences.

Distributed memory is memory that is physically located in separate computers. An indirect interface, such as message passing, is required to access memory on remote computers, while local memory can be accessed directly. Distributed memory is typically supported by clusters which, for purposes of this definition, we are considering to be a collection of computers. Since the memory on attached coprocessors also cannot typically be addressed directly from the host, it can be considered, for functional purposes, to be a form of distributed memory.

DSP (Digital Signal Processor) is a computing device designed specifically for digital signal processing tasks such as those associated with radio communications including filters, FFTs, and analog to digital conversions. The computational capabilities of DSPs alongside CPUs gave rise to some of the earliest examples of heterogeneous platforms and various programming languages extensions to control and interact with a DSP. OpenCL is a programming model that can help harness the compute aspects of DSPs. See also, heterogeneous platforms.

EFLOP/s (ExaFLOP/s) = 10^18 Floating-Point Operations per second.

EFLOPs (ExaFLOPs) = 10^18 Floating-Point Operations.

emacs is the best text editor in the world (according to James), and it is open source. Compare to the vi editor. “emacs” is the first package James installs on any computer that he uses.

Embarrassing parallelism is a description of an algorithm if it can be decomposed into a large number of independent tasks with little or no synchronization or communication required.

False sharing: Two separate tasks in two separate cores may write to separate locations in memory, but if those memory locations happened to be allocated in the same cache line, the cache coherence hardware will attempt to keep the cache lines coherent, resulting in extra interprocessor communication and reduced performance, even though the tasks are not actually sharing data.

Far memory: In a NUMA system, memory that has longer access times than the near memory. The view of which parts of memory are near vs. far depends on the process from which code is running. We also refer to this memory as nonlocal memory (in contrast to local memory) in Chapter 20 .

Floating-point number is a format for numbers in computers characterized by trading a higher range for the numbers for a reduced precision by using the bits available for a number (mantissa) and a shift count (exponent) that places the point to the left or right of a fixed position. In contrast, fixed-point representations lack an explicit exponent thereby allowing all bits to be used for number (mantissa).

Floating-point operation includes add, multiply, subtract, and more, done to floating-point numbers.

FLOP/s = Floating-Point Operations per second.

FLOPs = Floating-Point Operations.

Flow Graph is a way to describe an algorithm using graph notation. Use of graph notation means that a flow graph consists of computational nodes connected by edges denoting possible control flow.

Forward scaling is the concept of having a program or algorithm scalable already in threads and/or vectors so as to be ready to take advantage of growth of parallelism in future hardware with a simple recompile with a new compiler or relink to a new library. Using the right abstractions to express parallelism is normally a key to enabling forward scaling when writing a parallel program.

FPGA (Field Programmable Array) is a device that integrates a large number of gates (and often higher-level constructs such as DSPs, floating-point units, or network controllers) that remain unconnected to each other until the device is programmed. Programming was originally a whole chip process that was intended to be done once when a system was started, but modern FPGAs support partial reconfigurability and are often dynamically loaded with new programs as a matter of course. Traditionally, FPGAs were viewed as a way to consolidate a large number of discrete chips in a design into a single FPGA – usually saving board space, power, and overall cost. As such, FPGAs were programmed using tools similar to those used to design circuitry at a board or chip level – called high-level description languages (e.g., VHDL or Verilog). More recent use to harness FPGAs as compute engines has used the OpenCL programming model.

future-proofed: A computer program written in a manner so it will survive future computer architecture changes without requiring significant changes to the program itself. Generally, the more abstract a programming method is, the more future-proof that program is. Lower level programming methods that in some way mirror computer architectural details will be less able to survive the future without change. Writing in an abstract, more future-proof fashion may involve tradeoffs in efficiency, however.

GFLOP/s (GigaFLOP/s) = 10^9 Floating-Point Operations per second.

GFLOPs (GigaFLOPs) = 10^9 Floating-Point Operations.

GPU (Graphic Processing Unit) is a computing device designed to reform calculations associated with graphics such as lighting, transformations, clipping, and rendering. The computational capabilities of GPUs were originally designed solely for use in a “graphical pipeline” sitting between a general-purpose compute device (CPU) and displays. The emergence of programming support for using the computation without sending results to the display, and subsequent extensions to the designs of the GPU, lead to a more generalized compute capability being associated with many GPUs. OpenCL and CUDA are two popular programming models utilized to harness the compute aspects of GPUs. See also, heterogeneous platforms.

Grain , as in coarse-grained parallelism or fine-grained parallelism, or grain size , all refer to the concept of “how much work” gets done before moving to a new task and/or potentially synchronizing. Programs scale best when grains are as large as possible (so threads can run independently) but small enough to keep every compute resource fully busy (load-balancing). These two factors operate somewhat at odds with each other, which creates the need to consider grain size. TBB works to automate partitioning, but there is never a perfect world in which a programmer cannot help tune for the best performance based on knowledge of their algorithms.

Gustafson-Barsis’ law is a different view on Amdahl’s Law that factors in the fact that as problem sizes grow, the serial portion of computations tend to shrink as a percentage of the total work to be done.

Hardware thread is a hardware implementation of a task with a separate flow of control. Multiple hardware threads can be implemented using multiple cores, or can run concurrently or simultaneously on one core in order to hide latency using methods such as hyper-threading of a processor core. In the latter case (hyperthreading or simultaneous multithreading, SMT), it is said that a physical core features several logical cores (or hardware threads).

Heterogenous platforms consist of a mixture of compute devices instead of a homogeneous collection of only CPUs. Heterogenous computing is usually employed to provide specific acceleration via an attached device, such as a GPU, DSP, FPGA, and so on. See also OpenCL .

High-Performance Computing (HPC) refers to the highest performance computing available at a point in time, which today generally means at least a petaFLOP/s of computational capability. The term HPC is occasionally used as a synonym for supercomputing, although supercomputing is probably more specific to even higher performance systems. While the use of HPC is spreading to more industries, it is generally associated with helping solve the most challenging problems in science and engineering. High-performance data analytics workloads, often using Artificial Intelligence (AI) and Machine Learning (ML) techniques, qualify as HPC workloads in their larger instantiations and often combine well with long standing (traditional) HPC workloads.

Hyper-threading refers to multithreading on a single processor core with the purpose of more fully utilizing the functional units in an out-of-order core by bringing together more instructions than executable by one software thread. With hyper-threading, multiple hardware threads may run on one core and share resources, but some benefit is still obtained from parallelism or concurrency. Typically, each hyper-thread has, at least, its own register file and program counter, so that switching between hyper-threads is relatively lightweight. Hyper-threading is associated with Intel, see also simultaneous multithreading.

Latency is the time it takes to complete a task; that is, the time between when the task begins and when it ends. Latency has units of time. The scale can be anywhere from nanoseconds to days. Lower latency is better in general.

Latency hiding schedules computations on a processing element while other tasks using that core are waiting for long-latency operations to complete, such as memory or disk transfers. The latency is not actually hidden, since each task still takes the same time to complete, but more tasks can be completed in a given time since resources are shared more efficiently, so throughput is improved.

Load balancing assigns tasks to resources while handling uneven sizes of tasks. Optimally, the goal of load balancing is to keep all compute devices busy with minimal waste due to overhead.

Locality refers to utilizing memory locations that are closer, rather than further, apart. This will maximize reuse of cache lines, memory pages, and so on. Maintaining a high degree of locality of reference is a key to scaling.

Lock is a mechanism for implementing mutual exclusion . Before entering a mutual exclusion region, a thread must first try to acquire a lock on that region. If the lock has already been acquired by another thread, the current thread must block , which it may do by either suspending operation or spinning. When the lock is released, then the current thread is free to acquire it. Locks can be implemented using atomic operations , which are themselves a form of mutual exclusion on basic operations, implemented in hardware.

Loop-carried dependence if the same data item (e.g., element [3] of an array) is written in one iteration of a loop and is read in a different iteration of a loop, there is said to be a loop-carried dependence. If there are no loop-carried dependencies, a loop can be vectorized or parallelized. If there is a loop-carried dependence, the direction (prior iteration vs. future iteration, also known as backward or forward) and the distance (the number of iterations separating the read and write) must be considered.

Many-core processor is a multicore processor with so many cores that in practice we do not enumerate them; there are just “lots.” The term has been generally used with processors with 32 or more cores, but there is no precise definition.

Megahertz era is a historical period of time during which processors doubled clock rates at a rate similar to the doubling of transistors in a design, roughly every 2 years. Such rapid rise in processor clock speeds ceased at just under 4 GHz (four thousand megahertz) in 2004. Designs shifted toward adding more cores marking the shift to the multicore era .

Moore’s Law is an observation that, over the history of semiconductors, the number of transistors in a dense integrated circuit has doubled approximately every 2 years.

Message Passing Interface (MPI) is an industry-standard message-passing system designed to exchange data on a wide variety of parallel computers.

Multicore is a processor with multiple subprocessors, each subprocessor (known as a core ) supporting at least one hardware thread.

Multicore era is the time after which processor designs shifted away from rapidly rising clock rates and shifted toward adding more cores. This era began roughly in 2005.

Node (in a cluster) refers to a shared memory computer, often on a single board with multiple processors, that is connected with other nodes to form a cluster computer or supercomputer.

Nondeterministic: Exhibiting a lack of deterministic behavior, so results can vary from run to run of an algorithm. Concurrency is not the only factor that can lead to nondeterministic algorithms, but in practice it is often the cause. See more in the definition for deterministic .

Non-Uniform Memory Access (NUMA): Used to categorize memory design characteristics in a distributed memory architecture. NUMA = memory access latency is different for different memories. UMA = memory access latency is same for all memory. Compare with UMA. See Chapter 20 .

Offload: Placing part of a computation on an attached device such as an FPGA, GPU, or other accelerator.

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms. OpenCL provides host APIs for controlling offloading and attached devices, and extensions to C/C++ to express code to run on the attached accelerator (GPUs, DSPs, FPGAs, etc.) with the ability to use the CPU as a fallback if the attached device is not present or available at runtime.

OpenMP is an API that supports multiplatform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. It is made up of a set of compiler directives, library routines, and environment variables that influence runtime behavior. OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board and is jointly defined by a group of major computer hardware and software vendors ( http://openmp.org ).

Parallel means actually happening simultaneously. Two tasks that are both actually doing work at some point in time are considered to be operating in parallel. When a distinction is made between concurrent and parallel, the key is whether work can ever be done simultaneously. Multiplexing of a single processor core, by multitasking operating systems, has allowed concurrency for decades even when simultaneous execution was impossible because there was only one processing core.

Parallelism is doing more than one thing at a time. Attempts to classify types of parallelism are numerous.

Parallelization is the act of transforming code to enable simultaneous activities. The parallelization of a program allows at least parts of it to execute in parallel.

Pattern is a general, reusable solution to a common problem. Historically, TBB has used the term parallel algorithm when parallel pattern would also be an appropriate description.

PFLOP/s (PetaFLOP/s) = 10^15 Floating-Point Operations per second.

PFLOPs (PetaFLOPs) = 10^15 Floating-Point Operations.

Race conditions are nondeterministic behaviors in a parallel program that is generally a programming error. A race condition occurs when concurrent tasks perform operations on the same memory location without proper synchronization, and one of the memory operations is a write. Code with a race may operate correctly sometimes and fail other times.

Recursion is the act of a function being reentered while an instance of the function is still active in the same thread of execution. In the simplest and most common case, a function directly calls itself, although recursion can also occur between multiple functions. Recursion is supported by storing the state for the continuations of partially completed functions in dynamically allocated memory, such as on a stack, although if higher-order functions are supported a more complex memory allocation scheme may be required. Bounding the amount of recursion can be important to prevent excessive use of memory.

Scalability is a measure of the increase in performance as a function of the availability of more hardware to use in parallel.

Scalable: An application is scalable if its performance increases when additional parallel hardware resources are added. The term strong-scaling refers to scaling that occurs while a problem size does not need to be changed as more compute is available in order to achieve scaling. Weak-scaling refers to scaling that occurs only when a problem size is scaled up when additional compute is available. See scalability .

Serial means neither concurrent nor parallel.

Serialization occurs when the tasks in a potentially parallel algorithm are executed in a specific serial order, typically due to resource constraints. The opposite of parallelization.

Shared memory: When two units of parallel work can access data in the same location. Normally doing this safely requires synchronization. The units of parallel work, processes, threads, and tasks can all share data this way, if the physical memory system allows it. However, processes do not share memory by default and special calls to the operating system are required to set it up.

SIMD: Single-instruction-multiple-data referring to the ability to process multiple pieces of data (such as elements of an array) with all the same operation. SIMD is a computer architecture within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.

Simultaneous multithreading refers to multithreading on a single processor core. See also, hyper-threading.

Software thread is a virtual hardware thread; in other words, a single flow of execution in software intended to map one for one to a hardware thread. An operating system typically enables many more software threads to exist than there are actual hardware threads, by mapping software threads to hardware threads as necessary. Having more software threads than hardware threads is known as Oversubscription .

Spatial locality: Nearby when measured in terms of distance (in memory address). Compare with temporal locality. Spatial locality refers to a program behavior where the use of one data element indicates that data nearby, often the next data element, will probably be used soon. Algorithms exhibiting good spatial locality in data usage can benefit from cache lines structures and prefetching hardware, both common components in modern computers.

Speedup is the ratio between the latency for solving a problem with one processing unit vs. the latency for solving the same problem with multiple processing units in parallel.

SPMD: Single-program-multiple-data refers to the ability to process multiple pieces of data (such as elements of an array) with the same program, in contrast with a more restrictive SIMD architecture. SPMD most often refers to message passing programming on distributed memory computer architectures. SPMD is a subcategory of MIMD computer architectures within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.

STL (Standard Template Library) is a part of the C++ standard .

Strangled scaling refers to a programming error in which the performance of parallel code is poor due to high contention or overhead, so much so that it may underperform the nonparallel (serial) code.

Symmetric Multiprocessor (SMP) is a multiprocessor system with shared memory and running a single operating system.

Synchronization: The coordination, of tasks or threads, in order to obtain the desired runtime order. Commonly used to avoid undesired race conditions.

Task: A lightweight unit of potential parallelism with its own control flow, generally implemented at a user-level as opposed to OS-managed threads. Unlike threads, tasks are usually serialized on a single core and run to completion. When contrasted with “thread” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for executing tasks in parallel, while tasks are units of work that merely provide the opportunity for parallel execution; tasks are not themselves a mechanism of parallel execution.

Task parallelism: An attempt to classify parallelism as more oriented around tasks than data. We deliberately avoid this term, task parallelism, because its meaning varies. In particular, elsewhere “task parallelism” can refer to tasks generated by functional decomposition or to irregular tasks that still generated by data decomposition. In this book, any parallelism generated by data decomposition, regular or irregular, is considered data parallelism.

TBB: See Threading Building Blocks (TBB).

Temporal locality means nearby when measured in terms of time. Compare with spatial locality. Temporal locality refers to a program behavior in which data is likely to be reused relatively soon. Algorithms exhibiting good temporal locality in data usage can benefit from data caching, which is common in modern computers. It is not unusual to be able to achieve both temporal and spatial locality in data usage. Computer systems are generally more likely to achieve optimal performance when both are achieved hence the interest in algorithm design to do so.

Thread could refer to a software or hardware thread. In general, a “software thread” is any software unit of parallel work with an independent flow of control, and a “hardware thread” is any hardware unit capable of executing a single flow of control (in particular, a hardware unit that maintains a single program counter). When “thread” is compared with “task” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for implementing tasks. A multitasking or multithreading operating system will multiplex multiple software threads onto a single hardware thread by interleaving execution via software created time slices. A multicore or many-core processor consists of multiple cores to execute at least one independent software thread per core through duplication of hardware. A multithreaded or hyper-threaded processor core will multiplex a single core to execute multiple software threads through interleaving of software threads via hardware mechanisms.

Thread parallelism is a mechanism for implementing parallelism in hardware using a separate flow of control for each task.

Thread local storage refers to data which is purposefully allocated with the intent to only access from a single thread, at least during concurrent computations. The goal is to avoid need for synchronization during the most intense computational moments in an algorithm. A classic example of thread local storage is creating partial sums when working toward adding all numbers in a large array, by first adding subregions in parallel into local partial sums (also known as privatized variables) that, by nature of being local/private, require no global synchronization to sum into.

Threading Building Blocks (TBB) is the most popular abstract solution for parallel programming in C++. TBB is an open source project created by Intel that has been ported to a very wide range of operating systems and processors from many vendors. OpenMP and TBB seldom compete for developers in reality. While more popular than OpenMP in terms of the number of developers using it, TBB is popular with C++ programmers whereas OpenMP is most used by Fortran and C programmers.

Throughput is defined as the rate at which those tasks are completed, given a set of tasks to be performed. Throughput measures the rate of computation, and it is given in units of tasks per unit time. See bandwidth and latency .

TFLOP/s (TeraFLOP/s) = 10^12 Floating-Point Operations per second.

TFLOPs (TeraFLOPs) = 10^12 Floating-Point Operations.

Tiling is when you divide a loop into a set of parallel tasks of a suitable granularity. In general, tiling consists of applying multiple steps on a smaller part of a problem instead of running each step on the whole problem one after the other. The purpose of tiling is to increase reuse of data in caches. Tiling can lead to dramatic performance increases when a whole problem does not fit in cache. We prefer the term “tiling” instead of “blocking” and “tile” instead of “block.” Tiling and tile have become the more common term in recent times.

TLB is an abbreviation for Translation Lookaside Buffer. A TLB is a specialized cache that is used to hold translations of virtual to physical page addresses. The number of elements in the TLB determines how many pages of memory can be accessed simultaneously with good efficiency. Accessing a page not in the TLB will cause a TLB miss. A TLB miss typically causes a trap to the operating system so that the page table can be referenced and the TLB updated.

Trip count is the number of times a given loop will execute (“trip”); same thing as iteration count .

Uniform Memory Access (UMA): Used to categorize memory design characteristics in a distributed memory architecture. UMA = memory access latency is same for all memory. NUMA = memory access latency is different for different memories. Compare with NUMA. See Chapter 20 .

Vector operations are low-level operations that can act on multiple data elements at once in SIMD fashion.

Vector parallelism is a mechanism for implementing parallelism in hardware using the same flow of control on multiple data elements.

Vectorization is the act of transforming code to enable simultaneous computations using vector hardware. Multiprocessor instructions such as MMX, SSE, AVX, AVX2, and AVX-512 instructions utilize vector hardware, but vector hardware outside of CPUs may come in other forms that are also targeted by vectorization. The vectorization of code tends to enhance performance because more data is processed per instruction than would be done otherwise. See also vectorize .

Vectorize refers to converting a program from a scalar implementation to a vectorized implementation to utilize vector hardware such as SIMD instructions (e.g., MMX, SSE, AVX, AVX2, AVX-512). Vectorization is a specialized form of parallelism.

vi is a text-based editor that was shipped with most UNIX and BSD systems written by Bill Joy, popular only to those who have yet to discover emacs (according to James). Yes, it is open source. Compares unfavorably to emacs and Atom. Yes, Ron, James did look at the “vi” nutshell book you gave to him… he still insists on using vi just long enough to get emacs downloaded and installed.

Virtual memory decouples the address used by software from the physical addresses of real memory. The translation from virtual addresses to physical addresses is done in hardware that is initialized and controlled by the operating system.

Index

A

A-B-A problem

affinity_partitioner

Algorithms

parallel_deterministic_reduce

parallel_do

parallel_for

parallel_for_each

parallel_invoke

parallel_pipeline

parallel_reduce

parallel_scan

parallel_sort

pipeline

Algorithm vs. patterns

alignas() method

aligned_space

async_node

Atomic variables

auto_partitioner

available_devices() function

B

blocked_range

blocked_range2d

blocked_range3d

blocked_rangeNd

Buffering nodes

C

Cache

Cache affinity

cached_aligned_allocator template

Cache lines

Cache-oblivious algorithm

Coarse-grained locking

combinable<T> object

combine_each() function

compare_and_swap (CAS)

Composability

NCR ( see Non-Composable Runtime (NCR))

nested parallelism

TBB

parallelism

thread pool (the market) and arenas

work stealing

work isolation

composite_node

Concurrent containers

Contention

Context switching

Continuation task

blocking style

continuation-passing style

ref_count

scheduler bypass

Control flow nodes

D

Data Analytics Acceleration Library (DAAL)

Data parallelism

Data placement

Data placement and processor affinity, NUMA

hwloc_alloc_membind

hwloc_get_obj_by_type

hwloc library

hwloc_set_cpubind

nodes, bind

numa_node

Data structures

associative containers

hashing

map vs. set

multiple values

unordered

unordered associative containers

Data type definitions, triad

Deadlock

Debugging macros

Dependency graphs

continue_node objects

edges

implementation

addEdges functions

continue_node objects

createNode function

dependencies

forward substitution

parallel_reduce

serial blocked code

serial tiled implementation

synchronization points

scalability

Design patterns

Determinism

Device Filters

Device selector

Divide-and-conquer pattern

3D stereoscopic images

DYLD_INSERT_LIBRARIES

Dynamic memory interface replacement

Dynamic priorities

E

Embarrassing parallelism

enumerable_thread_specific (ETS) class

combine_each() function

parallel histogram computation

reduction

Environment variable

ETS, see enumerable_thread_specific (ETS) class

Event-based coordination pattern

Exception handling

example

tbb_exception and movable_exception classes

F

Fair mutexes

False sharing

alignas() method

histogram vector

jemalloc and tcmalloc

padding

Fibonacci Sequence

Fine-grained locking

convoying

deadlock

oversubscription

Fire-and-forget tasks

First-in-first-out (FIFO)

Floating-point settings

Floating-point types

Flow graph

Flow Graph Analyzer (FGA) tool

for_each

Fork-join layer

TBB library

Fork-join pattern

Forward substitution

Functional parallelism

G

Generic algorithms

GPU kernel execution

Grainsize

Graph object

H

Hard thread limits, see Thread limits

Hardware Transactional Memory (HTM)

Hash functions

HASH MAPS

Heterogeneous triad computation

Heterogeneous triad, flow graph

High-Performance Computing (HPC)

Huge pages

hwloc

hwloc and TBB

baseline triad implementation

on_scheduler_entry

PinningObserver class

I

Imbalanced pipeline

Integrated Development Environment (IDE)

Intel Advisor tool

Isolate function

for correctness

deadlock

nested parallelism

task_arena

in task arenas

correctness issues

namespace

J

jemalloc

join_node

Join nodes

K

Kernel arguments

key() function

L

Lambda expressions

Lambda expressions–vs-user-defined classes

LD_PRELOAD environment variable

libnuma

Lightweight policy

likwid

likwid-bench

likwid-perfctr

limiter_node

Linear pipeline

Line of sight problem

llalloc

Local allocation policy

Locality

Lock-free techniques

Locking

Lock preemption

Low-level implementation of a wavefront

data dependence flow

2D wavefront pattern

parallelization strategy

recycling

sequential version

task-based implementation

Low-level tasking interface

lscpu

lstopo

M

Macros

malloc

Linux

macOS

Windows

map/multimap and set/multiset interfaces

Map pattern

Map vs. set

Math Kernel Library (MKL)

max_number_of_live_tokens

Memory allocation

replacing new and delete

Memory allocation/deallocation

Memory allocators

allocator concept

functions

memory_pool and fixed_pool

memory pool concept

special controls

template classes

memory_pool_allocator

Message-driven layer

Message passing

multifunction_node

Multiresolution

Mutex

Mutex flavors

Mutual exclusion

N

NDRange concept

Nested composition

Nested parallelism

composability

Nesting pattern

New/delete operators

new operators, replacing

Node granularity

FG loop function

FG loop per worker function

master loop function

serial loop function

Nodes

Non-Composable Runtime (NCR)

concurrent executions

construct processes

two-level deep nesting

Non-preemptive priorities

in task class

priority inversion

priority levels

task execution

thread priorities

threads

Non-Uniform Memory Access (NUMA)

locality

note_affinity function

numactl command

O

OpenCL

NDRange

streaming_node

opencl_buffer.begin() function

opencl_buffer.data() member function

opencl_device

opencl_program

OpenMP

composability

NCP

Ordering issues

P

Padding

Parallel Continuation Machine (PCM)

parallel_deterministic_reduce

parallel_do algorithm

Parallel execution

parallel_for algorithm

parallel_for_each algorithm

parallel_invoke algorithm

Parallel loop/pipeline

Parallel patterns vs. parallel algorithms

parallel_pipeline algorithm

filter_t

flow_control

Hello, World example

parallel_pipeline function

parallel_policy

Parallel programming, patterns

parallel_reduce algorithm

parallel_scan algorithm

parallel_sort algorithm

Parallel STL

parallel_unsequenced_policy

Partitioners

par_unseq execution policy

Patterns

algorithm structures

branch-and-bound

data parallelism

design patterns

divide-and-conquer

event-based coordination

finding concurrency

fork-join

implementation mechanisms

map pattern

nesting

parallel patterns vs. parallel algorithms

parallel programming

pipeline

reduce pattern

scan operation

scan pattern

supporting structures

TBB templates

workpile pattern

Performance portability (portable)

Pipeline

tbb::parallel_pipeline

Pipeline parallelism

Pipeline pattern

Portable Hardware Locality (hwloc) package

Precompiled kernel

Preview features

Priorities

algorithms

enqueued tasks

flow graph

task_group_context

Priority inversion

Privatization

histogram computation

Processor affinity

Proportional splitting constructor

Proxy methods

environment variables

functions

Linux

macOS

routines

tbb_mem.cpp

test program

pstlvars scripts

Q

queueing_lightweight policy

queueing_mutex

R

RandomAccessIterator

Ranges

default constructor

requirements

splitting constructors

Range type

parallel quicksort

quicksort

Recursive mutexes

Recursive implementation

reduce operation

Reduce pattern/map-reduce

associative operations

blocked_range

floating-point types

maximum value loop

numerical integration

rectangular integral method

Reduction

histogram computation

Reduction patterns (reduce and scan)

Reduction template

rejecting_lightweight policy

Relaxed sequential semantics

reset() function

Resource Acquisition Is Initialization (RAII)

Rule of thumb

10,000 cycle

1 microsecond

S

Scalability

analysis

Scalable mutexes

scalable_allocation_command function

scalable_allocation_mode function

scalable_allocator template

Scalable memory allocation

Scaling

Scan pattern

Scheduler

sequenced_policy

sequencer_node

Shared Virtual Memory (SVM)

simple_partitioner

Single Instruction Multiple Data (SIMD)

extensions

layer

operations

parallelism

STL library

Soft thread limit, see Thread limits

source_node

source_node interface

speculative_spin_mutex

SPIR

Splitting constructor

Standard Template Library (STL)

algorithms

Intel’s Parallel

pre-built packages

execution policies

parallel_policy

parallel_unsequenced_policy

sequenced_policy

unsequenced_policy

use of

iterators

SIMD parallelism

std::for_each

std::for_each_n

std::reduce

std::transform

std::transform_reduce

transform_iterator class

static_partitioner

HPC

random work stealing

thread

Static priorities

std::aligned_alloc

std::allocate_shared

std::allocator<T> Signature

std::for_each

std::for_each_n

std::make_shared

std::reduce

std::transform

std::transform_reduce

STL containers

Streaming computations

Strong scaling

Synchronization

atomic<T> class

C++11 mutex

image histogram

computation

grayscale picture

mutual exclusion

sequential implementation

mutex

concept

example

scoped locking

T

task_arena

Task arenas

for isolation

abstraction

Double-Edged Sword

isolation for correctness

Task cancellation

Task granularity

task_group class

parallel Fibonacci code

recycling

run() and wait()

Task_group_context (TGC)

Task groups

high-level APIs

[structured_]task_group

task_group

Task parallelism

U

unlimited_node

Unordered associative containers

bucket methods

built-in locking vs. no visible locking

collisions

concurrent_hash_map

erase methods

hash map

iterators

map/multimap and set/multiset Interfaces

parallel scaling

Unsafe parallel implementation

image histogram computation

shared histogram vector

shared variable/shared mutable state

Unsequenced execution policy

unsequenced_policy

V

Vectorized execution

VTune

W, X, Y

Workpile pattern

Work stealing

cache-oblivious algorithms

dispatchers

loop pattern

per-thread task dispatchers

pseudo-code

scheduler bypass

schedulers

snapshot of

spawning mechanism

split tasks

Z

zero_allocator