One of the things that we like the most from TBB is its “multiresolution” nature. In the context of parallel programming models, multiresolution means that we can choose among different levels of abstraction to code our algorithm. In TBB, we have high-level templates, such as parallel_for or pipeline (see Chapter 2), that are ready to use when our algorithms fit into these particular patterns. But what if our algorithm is not that simple? Or what if the available high-level abstraction is not squeezing out the last drop of performance of our parallel hardware? Should we just give up and remain prisoners of the high-level features of the programing model? Of course not! There should be a capability to get closer to the hardware, a way to build our own templates from the ground up, and a way to thoroughly optimize our implementation using low-level and more tunable characteristics of the programming model. And in TBB, this capability exists. In this chapter, we will focus on one of the most powerful low-level features of TBB, the task programming interface. As we have said throughout the book, tasks are at the heart of TBB, and tasks are the building blocks used to construct the high-level templates such as parallel_for and pipeline. But there is nothing that prevents us from venturing into these deeper waters and starting to code our algorithms directly with tasks, from building our own high-level templates for future use on top of tasks, or as we describe in the next chapters, from fully optimizing our implementation by fine tuning the way in which tasks are executed. In essence, this is what you will learn by reading this chapter and the ones that follow. Enjoy the deep dive!
A Running Example: The Sequence
Task-based TBB implementations are particularly appropriate for algorithms in which a problem can be recursively divided into smaller subproblems following a tree-like decomposition. There are plenty of problems like these. The divide-and-conquer and branch-and-bound parallel patterns (Chapter 8) are examples of classes of such algorithms. If the problem is big enough, it usually scales well on a parallel architecture because it is easy to break it into enough tasks to fully utilize hardware and avoid load unbalance.
Mathematically, the nth number in the sequence, Fn, can be computed recursively as
Fn = Fn-1 + Fn-2
and take the upper-left element. The exponentiation over the matrix can be done quickly via repeated squaring. But, we’ll go ahead in this section and use the classic recursion example for teaching purposes.
The if (n<2) line at the beginning of the serial code of Figure 10-1 caters for what is called the base case , that is always needed in recursive codes to avoid infinite recursion, which is nice because we don’t want to nuke the stack, do we?
We will parallelize this first sequential implementation using different task-based approaches, from simpler to more elaborated and optimized versions. The lessons we learn with these examples can be mimicked in other tree-like or recursive algorithms, and the optimizations we show can also be put to work to make the most out of our parallel architecture in similar situations.
The High-Level Approach: parallel_invoke
If after looking at Figure 10-5 you think that it is silly to compute fib(29) in three different tasks and fib(28) in two additional ones, you are right, it is silly! As a disclaimer, we already said that this is not the optimal implementation but a commonly used recursive example that serves our educational interests. A clear optimization would be to implement recursion in a manner such that already computed Fibonacci numbers are not recomputed again, thus achieving the optimal O(log n) complexity, but this is not our goal today.
You may also be thinking, after looking at Figure 10-4, why in the world we are once again revisiting the parallel_invoke that was already covered way back in Chapter 2. Have we really reached the second, more advanced, section of this book? Yes! Well… where are the advanced features, the low-level knobs that we may need and the optimization opportunities that we love??? Okay, let’s start diving into deeper waters!!!
The Highest Among the Lower: task_group
Apparently, this is just a more verbose way of implementing the code of Figure 10-4 where we used parallel_invoke. However, we would like to underscore that, differently from the parallel_invoke alternative, now we have a handle to a group of tasks, g, and as we will discuss later, this enables some additional possibilities as task cancellation. Also, by explicitly calling the member functions g.run() and g.wait(), we spawn the new tasks and wait for them to finish their computation at two different program points, whereas the parallel_invoke function has an implicit task barrier after the task spawning. To begin with, this separation between run() and wait() would allow for the caller thread to do some computation between spawning some tasks and waiting for them in the blocking call wait(). In addition, this class also offers other interesting member functions that can come handy in some situations:
void run_and_wait( const Func& f ), which is equivalent to {run(f); wait();}, but guarantees that f runs on the current thread. We will see later (in section “The Low-Level Task Interface: Part Two – Task Continuation”) that there is a convenient trick to bypass the TBB scheduler. If we first call run(f), we basically spawn a task that gets enqueued in the worker thread local queue. When calling wait(), we call the scheduler that dequeues the just enqueued task if nobody else has stolen it in the meantime. The purpose of run_and_wait is twofold: first, we avoid the overhead of enqueueing-scheduling-dequeuing steps, and second, we avoid the potential stealing that can happen while the task is in the queue.
void cancel(), which cancels all tasks in this task_group. Maybe the computation was triggered from a user interface, UI, that also includes a “cancel” button. If the user now presses this button, there is a way to stop the computation. In Chapter 15, we further elaborate on cancellation and exception handling.
task_group_status wait(), which returns the final status of task group. Return values can be: complete (all tasks in the group have finished); canceled (task_group received a cancellation request); not_completed (not all tasks in the group have completed).
Note that, in our parallel implementation of Figure 10-6, each call to parallel_fib creates a new task_group so it is possible to cancel one branch without affecting the others, as we will see in Chapter 15. Having a single task_group also poses an additional downside: creating too many tasks in that single group results in task creation serialization and the ensuing loss of scalability . Consider for example we are tempted to write a code like this:
As we see, n tasks will be spawn one after the other and by the same thread. The other worker threads will be forced to steal every task created by the one executing g.run(). This will certainly kill the performance, especially if foo() is a fine-grained task and the number of worker threads, nth, is high. The recommended alternative is the one used in Figure 10-6 where a recursive deployment of tasks is exercised. In that approach, the worker threads steal at the beginning of the computation, and ideally, in log2(nth) steps all nth worker threads are working in their own tasks that in turn enqueue more tasks in their local queues. For example, for nth=4, the first thread, A, spawns two tasks and starts working in one while thread B steals the other. Now, threads A and B spawn two tasks each (four in total) and start working in two of them, but the other two are stolen by threads C and D. From now on, all four threads are working and enqueueing more tasks in their local queues and stealing again only when they run out of local tasks.
Beware! Enter at your own risk: the low-level tasking interface
The task class has lots of features, which means there are lots of ways to make mistakes too. If the required parallel pattern is a common one, there is certainly an already available high-level template, implemented and optimized by clever developers on top of the tasking interface. This high-level algorithm is the recommended way to go in most cases. The purpose of the rest of the chapter is therefore serving two goals. In the first place, it provides you with the means to develop your own task-based parallel algorithm or high-level template if the ones already provided by TBB do not suit your needs. The second one is to uncover the low-level details of the TBB machinery so that you can understand some optimizations and tricks that will be mentioned in future chapters. For example, later chapters will refer back to this chapter to explain the way the parallel_pipeline and Flow Graph can better exploit locality thanks to a scheduling-bypassing technique. Here, we explain how this technique works and why it is beneficial.
The Low-Level Task Interface: Part One – Task Blocking
- 1.
Allocate space for the task. Tasks must be allocated by special member functions so that the space can be efficiently recycled when the task completes. Allocation is done by a special overloaded new and task::allocate_root member function. The _root suffix in the name denotes the fact that the task created has no parent. It is the root of a task tree.
- 2.
Construct the task with the constructor FibTask{n,&sum} (the task definition is presented in the next figure), invoked by new. When the task is run in step 3, it computes the nth Fibonacci number and stores it into sum.
- 3.
Run the task to completion with task::spawn_root_and_wait.
The real work is done inside the class FibTask defined in Figure 10-8. This is a relatively larger piece of code, compared to fib and the two previous parallel implementations of parallel_fib. We were advised, this is a lower-level implementation, and as such it is not as productive or friendly as a high-level abstraction. To make up for the extra burden, we will see later how this class allows us to get our hands dirty tweaking under the hood and tuning the behavior and performance at our will.
Like all tasks scheduled by TBB, FibTask is derived from the class tbb::task. The fields n and sum hold the input value and the pointer to the output, respectively. These are initialized with the arguments passed to the constructor FibTask(long n_, long ∗sum_).
The execute member function does the actual computation. Every task must provide a definition of execute that overrides the pure virtual member function tbb::task::execute. The definition should do the work of the task and return either nullptr or a pointer to the next task to run, as we saw in Figure 9-14. In this simple example, it returns nullptr.
- 1.
Checks whether n<cutoff and resorts to the sequential version in this case.
- 2.
Otherwise, the else branch is taken. The code creates and runs two child tasks that compute Fn-1 and Fn-2 respectively. Here, the inherited member function allocate_child() is used to allocate space for the task. Remember that the top-level routine parallel_fib used allocate_root() to allocate space for a task. The difference is that here the task is creating child tasks. This relationship is indicated by the choice of allocation method. The different allocation methods are listed in Appendix B, Figure B-76.
- 3.
Calls set_ref_count(3). The number 3 represents the two children and an additional implicit reference that is required by the member function spawn_and_wait_for_all. This set_ref_count member function initializes the ref_count attribute of each TBB task. Each time a child ends its computation, it decrements the ref_count attribute of its parent. Make sure to call set_reference_count(k+1) before spawning the k children if the task uses wait_for_all to be resumed after the children complete. Failure to do so results in undefined behavior. The debug version of the library usually detects and reports this type of error.
- 4.
Spawns two child tasks. Spawning a task indicates to the scheduler that it can run the task whenever it chooses, possibly in parallel with executing other tasks. The first spawning, by the tbb::task::spawn(b) member function, returns immediately without waiting for the child task to start executing. The second spawning, by the member function tbb::task::spawn_and_wait_for_all(a), is equivalent to tbb::task::spawn(a); tbb::task::wait_for_all(). The last member function causes the parent to wait until all currently allocated child tasks are finished. For this reason, we say this implementation follows what we call a task blocking style.
- 5.
After the two child tasks complete, the ref_count attribute of the parent task has been decremented twice and now its value is one. This causes the parent task to resume just after the spawn_and_wait_for_all(a) call, so it computes x+y and stores it in ∗sum.
The member function execute() continues by setting ref_count of the task to 3, spawning first b and then a and waiting for both. At this point, the root task is suspended until it has no pending child. Remember that this is the task blocking style. The worker thread returns to the scheduler, where it will first dequeue task a (because it was enqueued last). This task a (FibTask(7,&x)) will recursively repeat the same process, suspending itself after allocating a new x and y on the stack and spawning FibTask(5,&x) and FibTask(6,&y). Since cutoff=7, these two new tasks will resort to the base case and call fib(5) and fib(6), respectively. FibTask(6,&x) is dequeued first, writes 8 to ∗sum (where sum points to x in FibTask(7) stack), and returns nullptr. Then, the FibTask(6,&x) is destroyed, but in the process, the ref_cont variable of the parent task (FibTask(7,&x)) is first decremented. The worker thread then dequeues FibTask(5,&y) that writes 5 to ∗sum (now alias of y in the stack) and also returns nullptr. This results in ref_count reaching the value 1, which wakes up the parent thread FibTask(7,&x) that just has to add 5+8, write it to ∗sum (alias of x in FibTask(8)) stack, and return nullptr. This decrements ref_count of the root task to 2. Next, the worker thread dequeues FibTask(6,&y) that calls fib(6), writes y=8 on the stack, returns, and dies. This finally leaves the root task without children (ref_count=1) so it can continue the execution just after the spawn_and_wait_for_all() member function, add 8+13, write to ∗sum (alias of sum in the stack of parallel_fib), and get destroyed. If you are exhausted after reading the description of all this process, so are we, but there is even a bit more so hold on for one more second. Now, imagine that there is more than one worker thread. Each one will have its own stack and fight to steal tasks. The result, 21, will be the same and, in essence, the same tasks will be executed, though now we don’t know which thread will take care of each task. What we do know is that if the problem size and the number of tasks are large enough and if the cutoff is wisely set, then this parallel code will run faster than the sequential one.
Note
As we have seen, the TBB work-stealing scheduler evaluates a task graph. The graph is a directed graph where each node is a task. Each task points to its parent, also called successor, which is another task that is waiting on it to complete. If a task has no parent/successor, its parent reference points to nullptr. Method tbb::task::parent() gives you read-only access to the successor pointer. Each task has a ref_count that accounts for the number of tasks that have it as a successor (i.e., the number of children that the parent has to wait for before it can be triggered for execution).
And where are the much-vaunted knobs and tuning possibilities? Well, it is true that the code based on low-level tasks that we just discussed is doing pretty much the same as what we already implemented with the parallel_invoke and task_group classes, but at higher programming cost. Then, where is the bang for the buck? The task class has more member functions that will be covered soon, and the implementation discussed in this section is just the foundation on which more optimized version will be built. Stick with us and keep reading.
The Low-Level Task Interface: Part Two – Task Continuation
The task blocking style that we just presented can pose a problem if the body of the task requires many local variables. These variables sit on the stack and stay there until the task is destroyed. But the task is not destroyed until all its children are done. This is a potential showstopper if the problem is very large, and it is difficult to find a cutoff value without limiting the amount of available parallelism. This can happen when facing Branch and Bound problems that are used to find an optimal value by wisely traversing a search space following a tree-based strategy. There are cases in which the tree can be very large, unbalanced (some tree branches are deeper than other), and the depth of the tree is unknown. Using the blocking style for these problems can easily result in an explosion of the number of tasks and too much use of the stack space.
Another subtle inconvenience of the blocking style is due to the management of the worker thread that encounters the wait_for_all() call in a parent task. There is no point in wasting this worker thread waiting for the children tasks to finish, so we entrust it with the execution of other tasks. This means that when the parent task is ready to run again, the original worker thread that was taking care of it may be distracted with other duties and cannot respond immediately.
Note
Continuation, continuation, continuation!!! The authors of TBB, and other parallelism experts, love to encourage continuation style programming. Why??? It turns out that using it can be the difference between a working program that is relatively easy to write, and one that crashes from stack overflow. Worst yet, other than using continuations, code to solve such crashes can be hard to understand and gives parallel programming a bad name. Fortunately, TBB is designed to use continuations and encourages us to use continuations by default. Flow Graph (Chapters 3 and 17) encourages use of continue_node (and other nodes with potentials for scheduler bypass). The power of continuations (and task recycling, which we cover next) is worth knowing as a parallel programmer – you’ll never want to let a task sit around waiting again (and wasting precious resources)!
In Figure 10-11, besides the base case that does not change, we identify in the else part of the code that now, x and y private variables are not declared anymore and have been commented out. However, there is now a new task c of type FibCont&. This task is allocated using the allocate_continuation() function that is similar to allocate_child(), except that it transfers the parent reference of the calling task (this) to c and sets the parent attribute of this to nullptr. The reference count, ref_count, of this’s parent does not change since the parent still has the same number of children, although one of the children has suddenly been mutated from the FibTask type to the FibCont one. If you are a happy parent, don’t try this at home!
The new tasks a and b are now children of c (not of this) because we allocate them using c.allocate_child() instead of just allocate_child(). In other words, c is now the successor of both a and b.
The result of the children is not written in stack-stored variables any more. When initializing a, the constructor invoked now is FibTask(n-1,&c.x), so the pointer sum in the child task a (a.sum) is actually pointing to c.x. Likewise, b.sum points to c.y.
The reference count of c (sort of an internal and private c.ref_count) is only set to two (c.set_ref_count(2)) since FibCont c has actually only two children (a and b).
Now children tasks a and b are ready to be spawned, and that’s all the duties of FibTask. Now it can die in peace, and the memory it was occupying can be safely reclaimed. R.I.P.
Note
As we mentioned in the previous section, when following the blocking style, if a task A spawns k children and waits for them using the wait_for_all member function, A.ref_count has to be set to k+1. The extra “1” accounts for the additional work that task A has to finish before ending and dispatching A’s parent. This extra “1” is not needed when following the continuation-passing style because A transfers the additional work to the continuation task C. In this case, C.ref_count has to be set exactly to k if it has k children.
You may be wondering if the parallel_fib function is still waiting in the spawn_root_and_wait(a) to the first root task that was created, since this original FibTask was replaced by the first FibCont object and then died (see Figure 10-12). Well, indeed parallel_fib is still waiting because spawn_root_and_wait is designed to work correctly with continuation-passing style. An invocation of spawn_root_and_wait(x) does not actually wait for x to complete. Instead, it constructs a dummy successor of x and waits for the successor’s ref_count to become 0. Because allocate_continuation forwards the parent reference to the continuation, the dummy successor’s ref_count is not decremented until the continuation FibCont completes. This is illustrated in Figure 10-14.
Bypassing the Scheduler
Push task a onto the thread’s deque.
Return from member function execute().
Pop task a from the thread’s deque, unless it is stolen by another thread.
spawn(a); return nullptr; | ➜ | //spawn(a); commented out! return &a; |
The Low-Level Task Interface: Part Three – Task Recycling
It marks this not to be automatically destroyed when execute returns.
It sets the successor of this to be c. To prevent reference-counting problems, recycle_as_child_of has a prerequisite that this must have a nullptr successor (this’s parent reference should point to nullptr). This is the case after allocate_continuation occurs.
Member variables have to be updated to mimic what was previously implemented using the constructor FibTask(n-1,&c.x). In this case, this->n is decremented (n -=1;), and this->sum is initialized to point to c.x.
When recycling, ensure that this’s member variables are not used in the current execution of the task after the recycled task is spawned. This is the case in our example since the recycled task is actually not spawned and will only run after returning the pointer this. You can spawn the recycled task instead (i.e., spawn (∗this); return nullptr;), as long as none of its member variables is used after the spawning. This restriction applies even to const member variables, because after the task is spawned, it might run and be destroyed before the parent progresses any further. A similar member function, task::recycle_as_continuation(), recycles a task as a continuation instead of as a child.
Note
Greener (and easier) parallel programming ☺
The embracing of composability, continuations, and task recycling has a powerful impact on making parallel programming much easier simply by using TBB. Consider that recycling has gained favor around the world, and recycling of tasks really does help conserve energy too! Join the movement for greener parallel programming – it doesn’t hurt that it makes effective programming easier too!
Scheduler bypassing and task recycling are powerful tools that can result in significant improvements and code optimizations. They are actually used to implement the high-level templates that were presented in Chapters 2 and 3, and we can also exploit them to design other tailored high-level templates that cater to our needs. Flow Graph (Chapter 3 and more coming in Chapter 17) encourages use of continue_node (and other nodes with potentials for scheduler bypass). In the next section, we present an example in which we leverage the low-level task API and evaluate its impact, but before that, check out our “checklist.”
Task Interface Checklist
Resorting to the task interface is advisable for fork-join parallelism with lots of forks, so that the task stealing can cause sufficient breadth-first behavior to occupy threads, which then conduct themselves in a depth-first manner until they need to steal more work. In other words, the task scheduler’s fundamental strategy is “breadth-first theft and depth-first work.” The breadth-first theft rule raises parallelism sufficiently to keep threads busy. The depth-first work rule keeps each thread operating efficiently once it has sufficient work to do.
Always use new(allocation_method) T to allocate a task, where allocation_method is one of the allocation methods of class task (see Appendix B, Figure B-76). Do not create local or file-scope instances of a task.
All siblings should be allocated before any start running, unless you are using allocate_additional_child_of. We will elaborate on this in the last section of the chapter.
Exploit continuation passing, scheduler bypass, and task recycling to squeeze out maximum performance.
If a task completes, and was not marked for re-execution (recycling), it is automatically destroyed. Also, its successor’s reference count is decremented, and if it hits zero, the successor is automatically spawned.
One More Thing: FIFO (aka Fire-and-Forget ) Tasks
So far, we have seen how tasks are spawned and the result of spawning a task: the thread that enqueues the task is likely the one dequeuing it in a LIFO (Last-in First-out) order (if no other thread steals the spawned task). As we said, this behavior has some beneficial implications in terms of locality and in restraining the memory footprint thanks to the “depth-first work.” However, a spawned task can become buried in the local queue of the thread if a bunch of tasks are also spawned afterward.
If we prefer FIFO-like execution order, a task should be enqueued using the enqueue function instead of the spawn one, as follows:
A spawned task can be postponed by the scheduler until it is waited upon, but an enqueued task will be eventually executed even if there is no thread explicitly waiting on the task. Even if the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks.
Spawned tasks are scheduled in a LIFO like order (most recently spawned is started next), but enqueued tasks are processed in roughly (not precisely) FIFO order (started in approximately the order they entered the queue – the “approximation” gives TBB some flexibility to be more efficient than a strict policy would allow).
Spawned tasks are ideal for recursive parallelism in order to save memory space thanks to a depth-first traversal, but enqueued tasks can prohibitively consume memory for recursive parallelism since the recursion will expand in a breadth-first traversal.
Spawned parent tasks should wait for their spawned children to complete, but enqueued tasks should not be waited upon because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion. Essentially, enqueued tasks should be allocated as root, instead of as children that are then waited upon.
In Chapter 14, enqueued tasks are also illustrated in the context of prioritizing some tasks over others. Two additional use cases are also described in the Threading Building Blocks Design Patterns manual (see “For More Information” at the end of the chapter). There are two design patterns in which enqueued tasks come in handy. In the first one, a GUI thread must remain responsive even when a long running task is launched by the user. In the proposed solution, the GUI thread enqueues a task but does not wait for it to finish. The task does the heavy lifting and then notifies the GUI thread with a message before dying. The second design pattern is also related with assigning nonpreemptive priorities to different tasks.
Putting These Low-Level Features to Work
In our task parallelization strategy, the basic unit of work is the computation performed by function foo at each (i,j) cell of the matrix. Without loss of generality, we assume that the computational load for each cell will be controlled by the gs (grainsize) parameter of the foo function. That way, we can define the granularity of the tasks, and therefore, we can study the performance of different implementations depending on the task granularity, as well as situations with homogeneous or heterogeneous task workloads.
In Figure 10-17(b), the arrows show the data dependence flow for our wavefront problem. For example, after the execution of the upper left task (1, 1), which does not depend on any other task, two new tasks can be dispatched (the one below (2, 1) and the one to the right (1, 2)). This dependence information can be captured by a 2D matrix with counters, like the one we show in Figure 10-17(b). The value of the counters points out to how many tasks we have to wait for. Only the tasks with the corresponding counter nullified can be dispatched.
Note that in Figure 10-19, we use allocate_additional_child_of(∗parent()) as the allocation method for the new tasks. By using this allocation method, we can add children while others are running. On the positive side, this allow us to save some coding that would have been necessary to ensure that all child tasks are allocated before any of them is spawned (since this depends on whether the east task, the south, or both are ready to be dispatched). On the negative side, this allocation method requires that the parent’s ref_count is atomically updated (incremented when one “additional_child” is allocated and decremented when any child dies). Since we are using allocate_additional_child_of(∗parent()), all created tasks will be children of the same parent. The first task of the task space is task (1, 1), and it is spawned with
and the parent of this root task is the dummy task that we already introduced in Figure 10-14. Then, all the tasks created in this code atomically update the ref_count of the dummy task.
Another caveat on using the allocate_additional_child_of allocation method is that the user (we) has to ensure that the parent’s ref_count does not prematurely reach 0 before the additional child is allocated. Our code already accounts for this eventuality since a task, t, allocating an additional child, c, is already guaranteeing that the t parent’s ref_count is at least one since t will only decrement its parent’s ref_count when dying (i.e., after allocating c).
In Chapter 2, the parallel_do_feeder template was already presented to illustrate a different wavefront application: the forward substitution. This template essentially implements a work-list algorithm, in such a way that new tasks can be added dynamically to the work-list by invoking the parallel_do_feeder::add() member function. We call wavefront_v2_feeder to a version of the wavefront code that relies on parallel_do_feeder and, as in Figure 2-19 in Chapter 2, uses feeder.add() instead of the spawn calls in Figure 10-19.
Now, let’s also exploit continuation-passing and task recycling styles. In our wavefront pattern, each task has the opportunity to spawn two new tasks (east and south neighbors). We can avoid the spawn of one of them by returning a pointer to the next task, so instead of spawning a new task, the current task recycles into the new one. As we have explained, with this we achieve two goals: reducing the number of task allocations, calls to spawn(), as well as saving the time for getting new tasks from the local queue. The resulting version is called wavefront_v4_recycle, and the main advantage is that it reduces the number of spawns from n x n − 2n (the number of spawns in previous versions) to n − 2 (approximately the size of a column). See the companion source code to have a look at the complete implementation.
For huge wavefront problems, it may be relevant to reduce the footprint of each allocated task. Depending on whether or not you feel comfortable using global variables, you can consider storing the shared global state of all the tasks (n, gs, A, and counters) in global variables. This alternative is implemented in wavefront_v6_global, and it is provided in the directory with the source code of this chapter’s examples.
It is clear that TBB v5 is the best solution in this experiment. In fact, we measured speedups for other finer-grained sizes finding that the finer the granularity, the better the improvement of v4 and v5 in comparison with v1 and v2. Besides, it is interesting to see that a great deal of the improvement contribution is due to the recycling optimization, pointed out by the v4 enhancement over the v1 version. A more elaborated study was conducted by A. Dios in the papers that are listed at the end of the chapter.
Since the performance of the wavefront algorithms decreases as the task workload grain becomes finer, a well-known technique to counteract this trend is tiling (see the Glossary for a brief definition). By tiling, we achieve several goals: to better exploit locality since each task works within a space confined region of data for some time; to reduce the number of tasks (and therefore, the number of allocations and spawns); and to save some overhead in wavefront bookkeeping (memory space and the initialization time of the counter/dependence matrix, which is now smaller due to it requiring a counter per block-tile, and not one per matrix element). After coarsening the grain of the tasks via tiling, we are again free to go for v1 or v2 implementations, right? However, the downside of tiling is that it reduces the amount of independent task (they are coarser, but there are fewer of them). Then, if we need to scale our application to a large number of cores and the problem does not grow in size at the same pace, we probably have to squeeze until the last drop of available performance out of the TBB low-level features. In challenging situations like these, we have to demonstrate our outstanding command of TBB and that we have successfully honed our parallel programming skills.
Summary
In this chapter, we have delved into the task-based alternatives that can be particularly useful to implement recursive, divide and conquer, and wavefront applications, among others. We have used the Fibonacci sequence as a running example that we first implemented in parallel with the already discussed high-level parallel_invoke. We then started diving into deeper waters by using a medium-level API provided by the task_group class. It is though the task interface the one offering the larger degree of flexibility to cater for our specific optimization needs. TBB tasks are underpinning the other high-level templates that were presented in the first part of the book, but we can also get our hands on them to build our own patterns and algorithms, leveraging continuation passing, scheduler bypassing, and task recycling advanced techniques. For even more demanding developers, more possibilities are available thanks to task priorities, task affinity, and task enqueue features that will we cover in the next chapter. We can’t wait to see what you can create and develop out of these powerful tools that are now in your hands.
For More Information
A. Dios, R. Asenjo, A. Navarro, F. Corbera, E.L. Zapata, A case study of the task-based parallel wavefront pattern, Advances in Parallel Computing: Applications, Tools and Techniques on the Road to Exascale Computing, ISBN: 978-1-61499-040-6, Vol. 22, pp. 65–72, IOS Press BV, Amsterdam, 2012 (extended version available here: www.ac.uma.es/~compilacion/publicaciones/UMA-DAC-11-02.pdf ).
A. Dios, R. Asenjo, A. Navarro, F. Corbera, E.L. Zapata High-level template for the task-based parallel wavefront pattern, IEEE Intl. Conf. on High Performance Computing (HiPC 2011), Bengaluru (Bangalore), India, December 18–21, 2011. Implement a high-level template on top of TBB task to ease the implementation of wavefront algorithms.
González Vázquez, Carlos Hugo, Library-based solutions for algorithms with complex patterns of parallelism, PhD report, 2015. http://hdl.handle.net/2183/14385 . Describes three complex parallel patterns and addresses them by implementing high-level templates on top of TBB tasks.
- Intel TBB Design Patterns:
GUI thread: http://software.intel.com/en-us/node/506119
Priorities: http://software.intel.com/en-us/node/506120
Wavefront: http://software.intel.com/en-us/node/506110
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.