What is the best method for scheduling parallel loops? How do we process data structures in parallel that do not support random-access iterators? What’s the best way to add parallelism to applications that look like pipelines? If the TBB library only provided tasks and a task scheduler, we would need to answer these questions ourselves. Luckily, we don’t need to plow through the many master’s theses and doctoral dissertations written on these topics. The TBB library developers have already done this dirty work for us! They provide the best-known methods for addressing these scenarios as template functions and template classes, a group of features known as the TBB generic parallel algorithms. These algorithms capture many of the processing patterns that are the cornerstones of multithreaded programming.
Note
The TBB library developers have historically used the term generic parallel algorithms to describe this set of features. By algorithm, they do not mean a specific computation like matrix multiplication, LU decomposition, or even something like std::find, but instead they mean common execution patterns. It has been argued by some reviewers of this book that these features would therefore be more accurately referred to as patterns and not algorithms. However, to align with the terminology that the TBB library has been using for many years, we refer to these features as generic parallel algorithms in this book.
We should have a strong preference for using these prewritten algorithms, whenever they apply, instead of writing our own implementations. The developers of TBB have spent years testing and improving their performance! The set of algorithms included in the TBB library do not, of course, exhaustively cover every possible scenario, but if one of them does match our processing pattern, we should use it. The algorithms provided by TBB capture the majority of the scalable parallelism in applications. In Chapter 8, we discuss design patterns for parallel programming, such as those described in Patterns for Parallel Programming by Mattson, Sanders and Massingill (Addison-Wesley), and how we can implement them using the TBB generic parallel algorithms.
TBB algorithms provide a powerful but relatively easy parallel model to apply because they can often be added incrementally and with a fairly local view of the code under consideration. We can look for the most time-consuming region of a program, add a TBB parallel algorithm to speed up that region, then look for the next most time-consuming region, add parallelism there, and so on.
It must be understood however that TBB algorithms do not guarantee parallel execution! Instead, they only communicate to the library that parallel execution is allowed. If we look at Figure 2-1 from a TBB perspective, this means that all of the worker threads may participate in executing parts of the computation, only a subset of threads may participate, or just the master thread may participate. Programs and libraries that assume that the parallelism is optional, like TBB does, are referred to as having relaxed sequential semantics .
A parallel program has sequential semantics if executing it using only a single thread does not change the program’s semantics. As we will note several times in this book though, the results of a sequential and parallel execution of a program may not always match exactly due to rounding issues and other sources of inexactness. We acknowledge these potential, nonsemantic differences by using the term relaxed sequential semantics. While models like TBB and the OpenMP API offer relaxed sequential semantics, some other models, such as MPI, let us write applications that have cyclic relationships that require parallel execution. The relaxed sequential semantics of TBB are an important part of what makes it useful for writing composable applications, as introduced in Chapter 1 and described in more detail in Chapter 9. For now, we should just remember that any of the algorithms described in this chapter will spread the work across one or more threads, but not necessarily all of the threads available in the system.
Lambda Expressions –vs- User-Defined Classes
Since the first release of TBB predates the C++11 standard that introduced lambda expressions into the language, the TBB generic algorithms do not require the use of lambda expressions. Sometimes we can use the same interface with lambda expressions or with function objects (functors). In other cases, there are two sets of interfaces for an algorithm: a set that is more convenient with lambda expressions and a set that is more convenient with user-defined objects.
Often the choice between using a lambda expression or a user-defined object is simply a matter of preference.
Functional / Task Parallelism
Perhaps the simplest algorithm provided by the TBB library is parallel_invoke, a function that allows us to execute as few as two functions in parallel, or as many as we wish to specify:
The pattern name for this concept is map – which we will discuss more in Chapter 8 when we discuss patterns explicitly. The independence expressed by this algorithm/pattern allows it to scale very well, making it the preferred parallelism to use when we can apply it. We will also see that parallel_for, because the loop bodies must be independent, can be used to similar effect.
If the two invocations of serialQuicksort execute for roughly the same amount of time and there are no resource constraints, this parallel implementation can be completed in half the time it takes to sequentially invoke the functions one after the other.
Note
We as developers are responsible for invoking functions in parallel only when they can be safely executed in parallel. That is, TBB will not automatically identify dependencies and apply synchronization, privatization, or other parallelization strategies to make the code safe. This responsibility is ours when we use parallel_invoke or any of the parallel algorithms we discuss in this chapter.
Using parallel_invoke is straightforward, but a single invocation of parallel_invoke is not very scalable. A scalable algorithm makes effective use of additional cores and hardware resources as they become available.
An algorithm shows strong scaling if it takes less time to solve a problem with a fixed size as additional cores are added. For example, an algorithm that shows good strong scaling may complete the processing of a given data set two times faster than the sequential algorithm when two cores are available but complete the processing of the same data set 100 times faster than the sequential algorithm when 100 cores are available.
An algorithm shows weak scaling if it takes the same amount of time to solve a problem with a fixed data set size per processor as more processors are added. For example, an algorithm that shows good weak scaling may be able to process two times the data than its sequential version in a fixed period of time using two processors and 100 times the data than its sequential version in that same fixed period of time when using 100 processors.
Using a parallel_invoke to execute two sorts in parallel will demonstrate neither strong nor weak scaling, since the algorithm can at most make use of two processors. If we have 100 processors available, 98 of them will be idle because we have not given them anything to do. Instead of writing code like our example, we should develop scalable applications that allow us to implement parallelism once without the need to revisit the implementation each time new architectures containing more cores become available.
Luckily, TBB can handle nested parallelism efficiently (described in detail in Chapter 9), and so we can create scalable parallelism by using parallel_invoke in recursive divide-and-conquer algorithms (a pattern we discuss in Chapter 8). TBB also includes additional generic parallel algorithms, covered later in this chapter, to target patterns that haven proven effective for achieving scalable parallelism, such as loops.
A Slightly More Complicated Example: A Parallel Implementation of Quicksort
We can develop a parallel implementation of quicksort as shown in Figure 2-5 by replacing the two recursive calls to serialQuicksort with a parallel_invoke. In addition to the use of parallel_invoke, we also introduce a cutoff value. In the original serial quicksort, we recursively partition all the way down to arrays of a single element.
Note
Spawning and scheduling a TBB task is not free – a rule of thumb is that a task should execute for at least 1 microsecond or 10,000 processor cycles in order to mitigate the overheads associated with task creation and scheduling. We provide experiments that demonstrate and justify this rule of thumb in more detail in Chapter 16.
You may notice that the parallel implementation of quicksort has a big limitation – the shuffle is done completely serially. At the top level, this means we have an O(n) operation that is done on a single thread before any of the parallel work can begin. This can limit the speedup. We leave it up to those that are interested to see how this limitation might be addressed by known parallel partitioning implementations (see the “For More Information” section at the end of this chapter).
Loops: parallel_for, parallel_reduce, and parallel_scan
For many applications, the execution time is dominated by time spent in loops. There are several TBB algorithms that express parallel loops, letting us quickly add scalable parallelism to the important loops in an application. The algorithms labeled as “Simple Loops” in Figure 2-2 are ones where the beginning and end of the iteration space can easily be determined by the time that the loop starts.
For example, we know there will be exactly N iterations in the following loop, so we classify it as a simple loop:
All of the simple loop algorithms in TBB are based on two important concepts, a Range and a Body. A Range represents a recursively divisible set of values. For a loop, a Range is typically the indices in the iteration space or the values that an iterator will take on as it iterates through a container. The Body is the function we apply to each value in the Range; in TBB, the Body is typically provided as a C++ lambda expression but can also be provided as a function object (see Lambda expressions –vs- user-defined classes).
parallel_for : Applying a Body to Each Element in a Range
Let’s start with a small serial for loop that applies a function to an element of an array in each iteration:
We can create a parallel version of this loop by using a parallel_for:
When TBB executes a parallel_for, the Range is divided up into chunks of iterations. Each chunk, paired with a Body, becomes a task that is scheduled onto one of the threads that participate in executing the algorithm. The TBB library handles the scheduling of tasks for us, so all we need to do is to use the parallel_for function to express that the iterations of the loop should be executed in parallel. In later chapters, we discuss tuning the behavior of TBB parallel loops. For now, let us assume that TBB generates a good number of tasks for the range size and number of available cores. In most cases, this is a good assumption to make.
It is important to understand that by using a parallel_for, we are asserting that it’s safe to execute the iterations of the loop in any order and in parallel with each other. The TBB library does nothing to check that executing the iterations of a parallel_for (or in fact any of the generic algorithms) in parallel will generate the same results as a serial execution of the algorithm – it is our job as developers to be sure that this is the case when we choose to use a parallel algorithm. In Chapter 5, we discuss synchronization mechanisms in TBB that can be used to make some unsafe code, safe. In Chapter 6, we discuss concurrent containers that provide thread-safe data structures that can also sometimes help us make code thread-safe. But ultimately, we need to ensure when we use a parallel algorithm that any potential changes in read and write access patterns do not change the validity of the results. We also need to ensure that we are using only thread-safe libraries and functions from within our parallel code.
For example, the following loop is not safe to execute as a parallel_for since each iteration depends on the result of the previous iteration. Changing the order of execution of this loop will alter the final values stored in the elements of array a:
Imagine if the array a={1,0,0,0,...,0}. After executing this loop sequentially, it will hold {1,2,3,4,...,N}. But if the loop executes out-of-order, the results will be different. A mental exercise, when looking for loops that are safe to execute in parallel, is to ask yourself whether the results will be the same if the loop iterations are executed all at once, or in random order, or in reverse order. In this case, if a={1,0,0,0,...,0} and the iterations of the loop are executed in reverse order, a will hold {1,2,1,1,...,1} when the loop is complete. Obviously, execution order matters for this loop!
Formal descriptions of data dependence analysis are beyond the scope of this book but can be found in many compiler and parallel programming books, including High-Performance Compilers for Parallel Computing by Michael Wolfe (Pearson) and Optimizing Compilers for Modern Architectures by Allen and Kennedy (Morgan Kaufmann). Tools like Intel Inspector in Intel Parallel Studio XE can also be used to find and debug threading errors, including in applications that use TBB.
A Slightly More Complicated Example: Parallel Matrix Multiplication
The code in Figure 2-8 quickly gets us a basic parallel version of matrix multiply. While this is a correct parallel implementation, it will leave a lot of performance on the table because of the way it is traversing the arrays. In Chapter 16, we will talk about the advanced features of parallel_for that can be used to tune performance.
parallel_reduce : Calculating a Single Result Across a Range
Another very common pattern found in applications is a reduction, commonly known as the “reduce pattern” or “map-reduce” because it tends to be used with a map pattern (see more about pattern terminology in Chapter 8).
A reduction computes a single value from a collection of values. Example applications include calculating a sum, a minimum value, or a maximum value.
Let’s consider a loop that finds the maximum value in an array:
Computing a maximum from a set of values is an associative operation; that is, it’s legal to perform this operation on groups of values and then combine those partial results, in order, later. Computing a maximum is also commutative, so we do not even need to combine the partial results in any particular order.
For loops that perform associative operations, TBB provides the function parallel_reduce:
A complete description of the parallel_reduce interfaces is provided in Appendix B.
Many common mathematical operations are associative, such as addition, multiplication, computing a maximum, and computing a minimum. Some operations are associative in theory but are not associative when implemented on real systems due to limitations in numerical representations. We should be aware of the implications of depending on associativity for parallelism (see Associativity and floating-point types).
Associativity and Floating-Point Types
In computer arithmetic, it is not always practical to represent real numbers with exact precision. Instead, floating-point types such as float, double, and long double are used as an approximation. The consequence of these approximations is that mathematical properties that apply to operations on real numbers do not necessarily apply to their floating-point counterparts. For example, while addition is associative and commutative on real numbers, it is neither of these for floating-point numbers.
But there is a limited number of significant digits in the float representation and so not all integer values can be represented exactly. So, for example, if we run this loop with N == 10e6 (10 million), we will get an output of 10000000. But if we execute this loop with N == 20e6, we get an output of 16777216. The variable r simply cannot represent 16777217 since the standard float representation has a 24-bit mantissa (significand) and 16777217 requires 25 bits. When we add 1.0, the result rounds down to 16777216, and each subsequent addition of 1.0 also rounds down to 16777216. To be fair, at each step, the result of 16777216 is a good approximation of 16777217. It is the accumulation of these rounding errors that makes the final result so bad.
Why? Because r can represent larger numbers, just not always exactly. The values in tmp1 and tmp2 are of similar magnitude, and therefore the addition impacts the available significant digits in the representation, and we get a result that is a good approximation of 20 million. This example is an extreme case of how associativity can change the results of a computation using floating-point numbers.
The take-away of this discussion is that when we use a parallel_reduce, it uses associativity to compute and combine partial results in parallel. So, we may get different results when compared to a serial implementation when using floating-point numbers. And in fact, depending on the number of participating threads, the implementation of parallel_reduce may choose to create a different number of partial results from run to run. Therefore, we may also get different results from run to run in the parallel implementation, even on the same input.
Before we panic and conclude that we should never use a parallel_reduce, we should keep in mind that implementations that use floating-point numbers generally result in an approximation. Getting different results on the same input does not necessarily mean that at least one of the results is wrong. It just means that the rounding errors accumulated differently for two different runs. It is up to us as developers to decide whether or not the differences matter for an application.
If we want to ensure that we at least get the same results on each run on the same input data, we can choose to use a parallel_deterministic_reduce as described in Chapter 16. This deterministic implementation always creates the same number of partial results and combines them in the same order for the same input, so the approximation will be the same from run to run.
As with all of the simple loop algorithms, to use a TBB parallel_reduce, we need to provide a Range (range) and Body (func). But we also need to provide an Identity Value (identity) and a Reduction Body (reduction).
To create parallelism for a parallel_reduce, the TBB library divides the range into chunks and creates tasks that apply func to each chunk. In Chapter 16, we discuss how to use Partitioners to control the size of the chunks that are created, but for now, we can assume that TBB creates chunks of an appropriate size to minimize overheads and balance load. Each task that executes func starts with a value init that is initialized with identity and then computes and returns a partial result for its chunk. The TBB library combines these partial results by calling the reduction function to create a single final result for the whole loop.
The identity argument is a value that leaves other values unchanged when they are combined with it using the operation that is being parallelized. It is well known that the identity element with respect to addition (additive identity) is “0” (since x + 0 = x) and that the identity element with respect to multiplication (multiplicative identity) is “1” (since x * 1 = x). The reduction function takes two partial results and combines them.
You may notice in Figure 2-10 that we use a blocked_range object for the Range, instead of just providing the beginning and ending of the range as we did with parallel_for. The parallel_for algorithm provides a simplified syntax that is not available with parallel_reduce. For parallel_reduce, we must pass a Range object directly, but luckily we can use one of the predefined ranges provided by the library, which include blocked_range, blocked_range2d, and blocked_range3d among others. These other range objects will be described in more detail in Chapter 16, and their complete interfaces are provided in Appendix B.
A blocked_range , used in Figure 2-10, represents a 1D iteration space. To construct one, we provide the beginning and ending value. In the Body, we use its begin() and end() functions to get the beginning and ending values of the chunk of values that this body execution has been assigned and then iterate over that subrange. In Figure 2-8, each individual value in the Range was sent to the parallel_for Body, and so there is no need for an i-loop to iterate over a range. In Figure 2-10, the Body receives a blocked_range object that represents a chunk of iterations, and therefore we still have an i-loop that iterates over the entire chunk assigned to it.
A Slightly More Complicated Example: Calculating π by Numerical Integration
As with parallel_for, there are advanced features and options that can be used with parallel_reduce to tune performance and to manage rounding errors (see Associativity and floating-point types). These advanced options are covered in Chapter 16.
parallel_scan : A Reduction with Intermediate Values
y 0 = x 0
y 1 = x 0 + x 1
. . .
y N = x 0 + x 1 + ... + x N
A serial loop that computes a running sum from a vector v follows:
On the surface, a scan looks like a serial algorithm. Each prefix depends on the results computed in all of the previous iterations. While it might seem surprising, there are however efficient parallel implementations of this seemingly serial algorithm. The TBB parallel_scan algorithm implements an efficient parallel scan. Its interface requires that we provide a range, an identity value, a scan body, and a combine body:
The range, identity value, and combine body are analogous to the range, identity value, and reduction body of parallel_reduce. And, as with the other simple loop algorithms, the range is divided by the TBB library into chunks and TBB tasks are created to apply the body (scan) to these chunks. A complete description of the parallel_scan interfaces is provided in Appendix B.
What is different about parallel_scan is that the scan body may be executed more than once on the same chunk of iterations – first in a pre-scan mode and then later in a final-scan mode.
In final-scan mode, the body is passed an accurate prefix result for the iteration that immediately precedes its subrange. Using this value, the body computes and stores the prefixes for each iteration in its subrange and returns the accurate prefix for the last element in its subrange.
However, when the scan body is executed in pre-scan mode, it receives a starting prefix value that is not the final value for the element that precedes its given range. Just like with parallel_reduce, a parallel_scan depends on associativity. In pre-scan mode, the starting prefix value may represent a subrange that precedes it, but not the complete range that precedes it. Using this value, it returns a (not yet final) prefix for the last element in its subrange. The returned value represents a partial result for the starting prefix combined with its subrange. By using these pre-scan and final-scan modes, it is possible to exploit useful parallelism in a scan algorithm.
How Does This Work?
First, we compute the scan of A in final-scan mode since it is the first set of values and so its prefix values will be accurate if it is passed an initial value of identity. At the same time that we start A, we start B in pre-scan mode. Once these two scans are done, we can now calculate accurate starting prefixes for both B and C. To B we provide the final result from A (92), and to C we provide the final-scan result of A combined with the pre-scan result of B (92+136 = 228).
The combine operation takes constant time, so it is much less expensive than the scan operations. Unlike the sequential implementation that takes three large steps that are applied one after the other, the parallel implementation executes final-scan of A and pre-scan of B in parallel, then performs a constant-time combine step, and then finally computes final-scan of B and C in parallel. If we have at least two cores and N is sufficiently large, a parallel prefix sum that uses three chunks can therefore be computed in about two thirds of the time of the sequential implementation. And parallel_prefix can of course execute with more than three chunks to take advantage of more cores.
A Slightly More Complicated Example: Line of Sight
Cook Until Done: parallel_do and parallel_pipeline
For some applications, simple loops get us full coverage of the useful parallelism. But for others, we need to express parallelism in loops where the range cannot be fully computed before the loop starts. For example, consider a while loop:
This loop keeps reading in images until there are no more images to read. After each image is read, it is processed by the function f. We cannot use a parallel_for because we don’t know how many images there will be and so cannot provide a range.
A more subtle case is when we have a container that does not provide random-access iterators:
Note
In C++, an iterator is an object that points to an element in a range of elements and defines operators that provide the ability to iterate through the elements of the range. There are different categories of iterators including forward, bidirectional, and random-access iterators. A random-access iterator can be moved to point to any element in the range in constant time.
Because a std::list does not support random access to its elements, we can obtain the delimiters of the range my_images.begin() and my_images.end(), but we cannot get to elements in between these points without sequentially traversing the list. The TBB library therefore cannot quickly (in constant time) create chunks of iterations to hand out as tasks since it cannot quickly point to the beginning and ending points of these chunks.
To handle complex loops like these, The TBB library provides two generic algorithms: parallel_do and parallel_pipeline.
parallel_do : Apply a Body Until There Are No More Items Left
A TBB parallel_do applies a Body to work items until there are no more items to process. Some work items can be provided up front when the loop begins, and others can be added by Body executions as they are processing other items.
The parallel_do function has two interfaces, one that accepts a first and last iterator and another that accepts a container. A complete description of the parallel_do interfaces is provided in Appendix B. In this section, we will look at the version that receives a container:
As a simple example, let us start with a std::list of std::pair<int, bool> elements, each of which contains a random integer value and false. For each element, we will calculate whether or not the int value is a prime number; if so, we store true to the bool value. We will assume that we are given functions that populate the container and determine if a number is prime. A serial implementation follows:
The TBB parallel_do algorithm will safely traverse the container sequentially, while creating tasks to apply the body to each element. Because the container has to be traversed sequentially, a parallel_do is not as scalable as a parallel_for, but as long as the body is relatively large (> 100,000 clock cycles), the traversal overhead will be negligible compared to the parallel executions of the body on the elements.
In addition to handling containers that do not provide random access, the parallel_do also allows us to add additional work items from within the body executions. If bodies are executing in parallel and they add new items, these items can be spawned in parallel too, avoiding the sequential task spawning limitations of parallel_do.
We can create a parallel implementation of this tree version using a parallel_do, as shown in Figure 2-19. To highlight the different ways to provide work items, in this implementation we use a container that holds a single tree of values. The parallel_do starts with only a single work item, but two items are added in each body execution, one to process the left subtree and the other to process the right subtree. We use the parallel_do_feeder.add method to add new work items to the iteration space. The class parallel_do_feeder is defined by the TBB library and is passed as the second argument to the body.
We should note that the two uses we considered of parallel_do have the potential to scale for different reasons. The first implementation, without the feeder in Figure 2-17, can show good performance if each body execution has enough work to do to mitigate the overheads of traversing the list sequentially. In the second implementation, with the feeder in Figure 2-19, we start with only a single work item, but the number of available work items grows quickly as the bodies execute and add new items.
A Slightly More Complicated Example: Forward Substitution
A parallel implementation that uses parallel_do is shown in Figure 2-23. Here, we use the interface of parallel_do that allows us to specify a beginning and ending iterator, instead of an entire container. You can see the details of this interface in Appendix B.
In the call to parallel_do in Figure 2-23, we initially provide only the top-left block, [&top_left, &top_left+1). But in each body execution, the if-statements at the bottom decrement the atomic counters of the blocks that are dependent on the block that was just processed. If a counter reaches zero, that block has all of its dependencies satisfied and is provided to the feeder.
Like the previous prime number examples, this example demonstrates the hallmark of applications that use parallel_do: the parallelism is constrained by the need to sequentially access a container or by the need to dynamically find and feed work items to the algorithm.
parallel_pipeline : Streaming Items Through a Series of Filters
The second generic parallel algorithm in TBB used to handle complex loops is parallel_pipeline. A pipeline is a linear sequence of filters that transform items as they pass through them. Pipelines are often used to process data that stream into an application such as video or audio frames, or financial data. In Chapter 3, we will discuss the Flow Graph interfaces that let us build more complex graphs that include fan-in and fan-out to and from filters.
As illustrated in Figure 2-25(b), in the steady state, each filter is busy, and their executions are overlapped. However, as shown in Figure 2-25(c), unbalanced filters decrease speedup. The performance of a pipeline of serial filters is limited by the slowest serial stage.
A complete description of the parallel_pipeline interfaces is provided in Appendix B. The interface of parallel_pipeline we use in this section is shown as follows:
The first argument max_number_of_live_tokens is the maximum number of items that will be allowed to flow through the pipeline at any given time. This value is necessary to constrain resource consumption. For example, consider the simple three filter pipeline. What if the middle filter is a serial filter and it takes 1000 times longer than the filter that gets new buffers? The first filter might allocate 1000 buffers only to queue them up before the second filter – wasting a lot of memory.
The second argument to parallel_pipeline is filter_chain, a series of filters created by concatenating filters that are created using the make_filter function:
The template arguments T and U specify the input and output types of the filter. The mode argument can be serial_in_order, serial_out_of_order, or parallel. And the f argument is the body of the filter. Figure 2-27 shows the implementation of the case change example using a TBB parallel_pipeline. A more complete description of the parallel_pipeline interfaces is provided in Appendix B.
In this implementation, the first and last filters are created using the serial_in_order mode. This specifies that both filters should run on only one item at a time and that the last filter should execute the items in the same order that the first filter generated them in. A serial_out_of_order filter is allowed to execute the items in any order. The middle filter is passed parallel as its mode, allowing it to execute on different items in parallel. The modes supported by parallel_pipeline are described in more detail in Appendix B.
A Slightly More Complicated Example: Creating 3D Stereoscopic Images
The TBB parallel_pipeline function imposes a linearization of the pipeline filters. The filters are applied one after the other as the input from the first stage flows through the pipeline. This is in fact a limitation for this sample. The processing of the left and right images is independent until the mergeImageBuffers filter, but because of the interface of parallel_pipeline, the filters must be linearized. Even so, only the filters that read in the images are serial filters, and therefore this implementation can still be scalable if the execution time is dominated by the later, parallel stages.
In Chapter 3, we introduce the TBB Flow Graph, which will allow us to more directly express applications that benefit from nonlinearized execution of filters.
Summary
This chapter offered a basic overview of the generic parallel algorithms provided by the TBB library, including patterns that capture functional parallelism, simple and complex loops, and pipeline parallelism. These prepackaged algorithms (patterns) provide well-tested and tuned implementations that can be applied incrementally to an application to improve performance.
The code shown in this chapter provides small examples that show how these algorithms can be used. In Part 2 of this book (starting with Chapter 9), we discuss how to get the most out of TBB by combining these algorithms in composable ways and tuning applications using the library features available for optimizing locality, minimizing overheads, and adding priorities. Part 2 of the book also discusses how to deal with exception handling and cancellation when using the TBB generic parallel algorithms.
We continue in the next chapter by taking a look at another one of TBB’s high-level features, the Flow Graph.
For More Information
We discussed design patterns for parallel programming and how these relate to the TBB generic parallel algorithms. A collection of design patterns can be found in
Timothy Mattson, Beverly Sanders, and Berna Massingill, Patterns for Parallel Programming (First ed.), 2004, Addison-Wesley Professional.
When discussing the parallel implementation of quicksort, we noted that the partitioning was still a serial bottleneck. Papers that discuss parallel partitioning implementations include
P. Heidelberger, A. Norton and J. T. Robinson, “Parallel Quicksort using fetch-and-add,” in IEEE Transactions on Computers, vol. 39, no. 1, pp. 133-138, Jan 1990.
P. Tsigas and Y. Zhang. A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000. In 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP 2003), pages 372–381, 2003.
You can learn more about data dependence analysis in a number of compiler or parallel programming books, including
Michael Joseph Wolfe, High-Performance Compilers for Parallel Computing, 1995, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
Kennedy and John R. Allen, Optimizing Compilers for Modern Architectures, 2001, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
When we discussed matrix multiplication, we noted that unless we are optimization gurus, we should typically prefer to use prepackaged implementations of linear algebra kernels when available.
Such packages include
The Basic Linear Algebra Subprograms (BLAS) at www.netlib.org/blas/
The Intel Math Kernel Library (Intel MKL) at https://software.intel.com/mkl
Automatically Tuned Linear Algebra Software (ATLAS) found at http://math-atlas.sourceforge.net/
The FLAME project researches and develops dense linear algebra libraries. Their BLIS software framework can be used to create high-performance BLAS libraries. The FLAME project can be found at www.cs.utexas.edu/~flame .
The line of sight example in this chapter was implemented using parallel scan based on the description provided in
Vector Models for Data-Parallel Computing, Guy E. Blelloch (The MIT Press).
The photograph used in Figures 2-28a, 2-29, and 3-7, was taken by Elena Adams, and is used with permission from the Halide project’s tutorials at http://halide-lang.org .
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.