This chapter contains some key tips on getting top performance from flow graphs in TBB. The less structured nature of the TBB flow graph APIs offers an expressiveness that requires some thinking to get the best scalable performance – we dive into details in this chapter that let us tune flow graphs to their full potential.
In Chapter 3, we introduced the classes and functions in the tbb::flow namespace and how they can be used to express simple data flow and dependency graphs. In this chapter, we discuss some of the more advanced questions and issues that arise when using TBB flow graphs. As in Chapter 16, much of our discussion will revolve around granularity, effective memory use, and creating sufficient parallelism. But because the flow graph APIs let us express parallelism that is less structured than the parallel algorithms described in Chapter 16, we will also discuss some dos and don’ts to be aware of when architecting a flow graph.
The section “Key FG Advice: Dos and Don’ts,” starting on page 480, gives very specific rules of thumb that are invaluable when using flow graphs with TBB.
We conclude this chapter with a brief overview of the Flow Graph Analyzer (FGA), a tool available within Intel Parallel Studio XE. It has strong support for the graphical design and analysis of TBB flow graphs. While using FGA is not required when working with flow graphs, visualizing graphs during design and analysis can be very helpful. The tool is freely available to everyone, and we highly recommend it for anyone doing serious TBB flow graph work.
Optimizing for Granularity, Locality, and Parallelism
In this section, we focus on the same three concerns that drove our discussions in Chapter 16. We first look at the impact of node granularity on performance. Because flow graphs are used for less structured algorithms, we need to consider how parallelism is introduced as we discuss granularity – does the structure require a significant amount of stealing or is the generation of tasks spread well across the threads? Also, we may want to use some very small nodes in a flow graph simply because they make the design clearer – in such cases, we describe how a node with a lightweight execution policy can be used to limit overheads. The second issue we will address is data locality. Unlike the TBB parallel algorithms, the flow graph API does not provide abstractions like Ranges and Partitioners; instead, it is designed to enhance locality naturally. We will discuss how threads follow data to exploit locality. Our third issue is creating sufficient parallelism. Just as in Chapter 16, optimizing for granularity and locality sometimes comes at the cost of restricted parallelism – we need to be sure we walk this tightrope carefully.
Node Granularity: How Big Is Big Enough?
In Chapter 16, we discussed Ranges and Partitioners and how these can be used to ensure that the tasks created by the TBB generic algorithms are large enough to amortize scheduling overheads while still being small enough to provide enough independent work items for scalability. The TBB flow graph does not have support for Ranges and Partitioners, but we still need to be concerned about task granularity.
To see if our rule of thumb for 1 microsecond tasks that we introduced in Chapter 16 applies as well to flow graph nodes as it does to parallel algorithm bodies, we will explore a few simple microbenchmarks that capture the extremes that can exist in flow graphs. We will compare the execution times of four functions and use different amounts of work per node execution. We will refer to these functions as Serial, FG loop, Master loop, and FG loop per worker.
It is our belief that studying these examples (Figures 17-1 to 17-4) is critical to having an intuitive grasp of some key issues that differentiate highly scalable flow graph usage and disappointing uses of flow graph. The APIs themselves, fully documented in Appendix B, do not provide this education – we hope you will study these examples enough to grasp the concepts as we believe this will make you much better at getting the most out of using TBB flow graphs (peek at Figure 17-5 to see a quantification of the benefits on performance of understanding these!).
Unless otherwise noted, all performance results presented in this chapter were collected on a single socket server with an Intel Xeon Processor E3-1230 with four cores supporting two hardware threads per core; the processor has a base frequency of 3.4 GHz, a shared 8 MB L3 cache, and per-core 256 KB L2 caches. The system was running SUSE Linux Enterprise Server 12. All samples were compiled using the Intel C++ Compiler 19.0 with Threading Building Blocks 2019, using the compiler flags “–std=c++11 –O2 –tbb”.
We can further understand the performance of our microbenchmarks by collecting a trace and viewing the results in Flow Graph Analyzer (FGA) – FGA is described in more detail at the end of this chapter. Figure 17-6 shows per-thread timelines for the different functions when using a spin-wait time of 1 microsecond. These timelines, which are all of the same length, show the work done by each thread over time. The gaps (in gray) in the timelines indicate when a thread is not actively executing a node’s body. In Figure 17-6(a), we see the behavior of FG loop , which acts like a serial loop. But we can see that the small gap between the try_put in the body and the exit from the task allows the tasks to ping-pong between the threads since they are able to steal each task as it is spawned. This partially explains the fairly large overheads for this microbenchmark shown in Figure 17-5. As we explain later in this chapter, most functional nodes use scheduler bypass to follow their data to the next node when possible (see the discussion on Pipelines and data locality and thread affinity in Chapter 16 for a more detailed discussion of why scheduler bypass improves cache performance). Since a multifunction_node puts output messages to its output ports directly inside of the body implementation, it cannot immediately follow the data to the next node using scheduler bypass – it has to finish its own body first! A multifunction_node therefore does not use scheduler bypass to optimize for locality. In any case, this makes the performance in Figure 17-6(a) a worst-case overhead, since scheduler bypass is not used.
In Figure 17-6(b), we see the case where the master thread is generating all of the tasks and the workers must steal each task, but tasks can be executed in parallel once they are stolen. Because the worker threads must steal each task, they are much slower at finding tasks than the master thread. The master thread is continually busy in Figure 17-6(b) – it can quickly pop a next task from its local deque – while the worker threads’ timelines show gaps during which they are fighting with each other to steal their next task from the master’s local deque.
Looking at these extremes of behavior and noting the performance in Figure 17-5, we feel comfortable recommending a similar rule of thumb for flow graph nodes. While a pathological case, like Master loop, shows a limited speedup of 2.8 with a 1 microsecond body, it still shows a speedup. If the work is more balanced, such as with FG loop per worker, a 1 microsecond body provides a good speedup. With these caveats in mind, we again recommend a 1 microsecond execution time as a crude guideline :
Rule of Thumb
Flow graph nodes should be at least 1 microsecond in execution time in order to profit from parallel execution. This translates to several thousand CPU cycles – if you prefer using cycles, we suggest a 10,000 cycle rule of thumb.
Just like with the TBB algorithms, this rule does not mean that we must avoid nodes smaller than 1 microsecond at all costs. Only if our flow graph’s execution time is dominated by small nodes do we really have a problem. If we have a mix of nodes with different execution times, the overhead introduced by the small nodes may be negligible compared to the execution time of the larger nodes.
What to Do If Nodes Are Too Small
If some of the nodes in a flow graph are smaller than the recommended 1 microsecond threshold, there are three options: (1) do nothing at all if the node does not have significant impact on the total execution time of the application, (2) merge the node with other surrounding nodes to increase granularity, or (3) use the lightweight execution policy.
If the node’s granularity is small, but its contribution to total execution time is also small, then the node can be safely ignored; just leave it as it is. In these cases, clarity of design may trump any inconsequential efficiency gained.
If the node’s granularity has to be addressed, one option is to merge it with surrounding nodes. Does the node really need to be encapsulated separately from its predecessors and successors? If the node has a single predecessor or a single successor and the same concurrency level, it might be easily combined with those nodes. If it has multiple predecessors or successors, then perhaps the operations that are performed by the node can be copied into each of the nodes. In any case, merging the nodes together can be an option if the merging does not change the semantics of the graph.
Finally, the node can be changed to use a lightweight execution policy via a template argument when the node is constructed. For example:
This policy indicates that the body of the node contains a small amount of work and should, if possible, be executed without the overhead of scheduling a task.
There are three lightweight policies to choose from: queueing_lightweight , rejecting_lightweight , and lightweight . These policies are described in detail in Appendix B. All of the functional nodes, except source_node, support lightweight policies. A lightweight node may not spawn a task to execute the body, but instead execute the body immediately inside of the try_put within the context of the calling thread. This means that the overheads of spawning are removed – but there is no opportunity for other threads to steal the task, so parallelism is restricted!
The lightweight policy cannot limit parallelism for the one chain case, since there is no parallelism in this graph to begin with. We therefore see in Figure 17-8 that it improves performance for all cases, although its impact becomes less significant as the node granularity increases. For the one chain case, the ratio approaches 1.0 as the overhead of spawning tasks becomes negligible compared to the body’s spin time. The two-chain case does have potential parallelism. However, if all of the nodes use a lightweight policy, both chains will be executed by the thread that executes the first multifunction_node and the potential parallelism will be eliminated. As we might expect then, as we approach our rule of thumb execution time of 1 microsecond, the benefits of the lightweight policy are overshadowed by the restricted parallelism. Even if the nodes spin for 0.1 microsecond, the ratio drops below 1. The ratio approaches 0.5 as the serialization of the graph results in the complete loss of our expected speedup of 2 when using two chains.
Addressing granularity issues through merging of nodes, or by using the lightweight policy, can decrease overheads, but as we see, they can also limit scalability. These “optimizations” can result in significant improvements, but must be applied judiciously or else they may do more harm than good.
Memory Usage and Data Locality
Unlike the TBB parallel algorithms that iterate over data structures, a flow graph passes data structures from node to node. The messages can be primitive types, objects, pointers or, in the case of a dependence graph, tbb::flow::continue_msg objects. For best performance, we need to consider both data locality and memory consumption. We will discuss both of these issues in this section.
Data Locality in Flow Graphs
Data passes between nodes , and when a node receives a message, it executes its body on the message as a TBB task. The task is scheduled using the same work-stealing dispatchers used by all TBB tasks. In Figure 17-6(a) when a serial loop was executed as a flow graph, we saw that a task spawned by one thread may be executed by another. We noted however that this was due in part to the microbenchmark using multifunction_node objects, which do not use scheduler bypass to optimize for performance.
In general, the other functional nodes, including source_node, function_node, and continue_node, use scheduler bypass if one of the successors can be immediately run. If the data accessed by one of these nodes fits into a data cache, then it can be reused by the same thread when it executes the successor.
Since we can benefit from locality in a flow graph, it is worth considering data size and even breaking the data into smaller pieces that can benefit from locality through scheduler bypass. For example, we can revisit the matrix transposition kernel that we used in Chapter 16 as an example to demonstrate this effect. We will now pass three pairs of a, b matrices using the FGMsg structure shown in Figure 17-9. You can see the serial, cache oblivious and parallel_for implementations of the matrix transposition kernel in Chapter 16 in Figure 16-6 through Figure 16-13.
Our simple implementation sends the full matrices, and these are processed, in a non-cache-oblivious fashion, by transpose. As we might expect, this does not perform well. On our test machine, it was only 8% faster than executing the non-cache-oblivious serial implementation of our matrix transposition from Chapter 16 three times in a row, once on each pair of matrices. This isn’t very surprising since the benchmark is memory bound – trying to execute multiple transpositions in parallel doesn’t help much when we can’t feed one transposition with the data it needs from memory. If we compare our simple flow graph to the serial cache-oblivious transposition from Chapter 16, it looks even worse, taking 2.5 times longer to process the three pairs of matrices when executed on our test machine. Luckily, there are many options for improving the performance of this flow graph. For example, we can use a serial cache-oblivious implementation in the transpose node. Or, we can use the parallel_for implementation from Chapter 16 that uses a blocked_range2d and simple_partitioner in the transpose node. We will see shortly that each of these will greatly improve our base case speedup of 1.08.
However, we might also send blocks of the matrices as messages instead of sending each pair of a and b matrices as a single big message. To do so, we extend our message structure to include a blocked_range2d:
It might be surprising that the tiled flow graph version with nested parallel_fors did not perform as well as the tiled flow graph without nested parallelism. In Chapter 9, we claimed that we can use nested parallelism with impunity in TBB – so what went wrong? The harsh reality is that once we start tuning the performance of our TBB applications – we often need to trade away full composability for performance (see the Aspects of Composability Sidebar). In this case, the nested parallelism interfered with the cache optimizations we were carefully trying to implement. Each node was sent a tile to process that was a good fit for its data cache – with nested parallelism, we then undid this perfect fit by sharing the tile with other threads.
Aspects of Composability
- (1)
Correctness (as an absolute)
- (2)
Ability to use (as a practical matter)
- (3)
Performance (as an aspiration)
In the first, we hope we can mix and match code without concerns that it will suddenly malfunction (get the wrong answer). TBB gives us this ability, and it is largely a solved problem – the one wrinkle being that nondeterministic order-of-execution will make answers vary when using finite precision math such as native floating-point arithmetic. We discuss that in Chapter 16 offering approaches to maintain the “correctness” aspects of composability in this light.
In the second, we hope that the program will not crash. This is a practical matter in many cases, because the most common problem (unbounded memory usage) could be theoretically solved with infinite sized memories. ☺ TBB largely solves this aspect of composability, giving it an advantage of programming models that do not (such as OpenMP). TBB does need more help here for the less structured flow graphs, so we discuss using limiter_nodes with flow graphs to keep memory usage in check – especially important in large flow graphs.
Finally, for optimal performance, we know of no general solution to full performance composability. The reality is that highly optimized code competing with other code running on the same hardware will interfere with the optimal performance of either code. This means we can benefit from manually tuning the code. Fortunately, TBB gives us control to tune, and tools like Flow Graph Analyzer help give us insights to guide our tuning. Once tuned, it is our experience that code can work well and feel composable – but the technology to blindly use code and get top performance does not exist. “Good enough” performance may happen often, but “great” requires work.
We shouldn’t get too focused on the specifics of the results in Figure 17-11 – this is, after all, a single memory-bound microbenchmark. But it does make clear that we can benefit by considering the size of our nodes, not only from a granularity perspective, but also from a data locality perspective. When we moved from a naïve implementation that sent whole arrays and did not implement tuned kernels in the nodes to our more cache-aware tiled flow graph version, we saw a significant performance improvement.
Picking the Best Message Type and Limiting the Number of Messages in Flight
As we allow messages into a graph, or make copies as we split them along multiple paths through a flow graph, we consume more memory. In addition to worrying about locality, we may also need to limit memory growth.
When a message is passed to a node in a data flow graph, it may be copied into the internal buffers in that node. For example, if a serial node needs to defer the spawning of task, it holds incoming messages in a queue until it is legal to spawn a task to process them. If we pass very large objects around in our flow graph, this copying can be expensive! Therefore, when possible, it is better to pass around pointers to large objects instead of the objects themselves.
Of course, we need to be careful when we use pointers to objects. By passing pointers and not objects, multiple nodes may have access to the same object at the same time through the shared_ptr. This is especially true if your graph relies on functional parallelism, where the same message is broadcast to multiple nodes. The shared_ptr will correctly handle the increments and decrements of the reference counts, but we need to be sure that we are properly using edges to prevent any potential race conditions when accessing the object that is pointed to.
As we saw in our discussion of how nodes map to tasks, when messages arrive at functional nodes, tasks may be spawned or messages may be buffered. When designing a data flow graph, we should not forget about these buffers and tasks, and their memory footprint.
For example, let’s consider Figure 17-13. There are two nodes, serial_node and unlimited_node; both contain a long spin loop. The for loop quickly allocates a large number of inputs for both nodes. Node serial_node is serial and so its internal buffer will grow quickly as it receives messages faster than its tasks complete. In contrast, node unlimited_node will immediately spawn tasks as each message arrives – quickly flooding the system with a very large number of tasks – many more than the number of worker threads. These spawned tasks will be buffered in the internal worker thread queues. In both cases, our graph might quickly consume a large amount of memory because they allow BigObject messages to enter the graph more quickly than they can be processed.
There are three common approaches to managing resource consumption in a flow graph: (1) use a limiter_node, (2) use concurrency limits, and/or (3) use a token-passing pattern.
A limiter_node maintains an internal count of the messages that pass through it. A message sent to the decrement port on a limiter_node decrements the count, allowing additional messages to pass through. If the count is equal to the node’s threshold, any new messages that arrive at its input port are rejected.
We can turn off the internal buffering for a function_node by constructing it with an execution policy, flow::rejecting or flow::rejecting_lightweight. The source_node in Figure 17-16 continues to generate new outputs only if they are being consumed.
The final common approach for limiting resource consumption in a data flow graph is to use a token-based system. As described in Chapter 2, tbb::parallel_pipeline algorithm uses tokens to limit the maximum number of items that will be in flight in a pipeline. We can create a similar system using tokens and a reserving join_node as shown in Figure 17-17. In this example, we create a source_node source and buffer_node token_buffer. These two nodes are connected to the inputs of a reserving join_node join. A reserving join_node, join_node< tuple< BigObjectPtr, token_t >, flow::reserving >, only consumes items when it can first reserve inputs at each of its ports. Since a source_node stops generating new messages when its previous message has not been consumed, the availability of tokens in the token_buffer limits the number of items that can be generated by the source_node. As tokens are returned to the token_buffer by node unlimited_node, they can be paired with additional messages generated by the source, allowing new source tasks to be spawned.
In Figure 17-18, we use int as the token type. In general, we can use any type as a token, even large objects or pointers. For example, we could use BigObjectPtr objects as the tokens if we want to recycle BigObject objects instead of allocating them for each new input.
Task Arenas and Flow Graph
Both implicit and explicit task arenas impact the behavior of TBB tasks and the TBB generic parallel algorithms. The arena in which tasks are spawned controls which threads can participate in executing the tasks. In Chapter 11, we saw how we can use implicit and explicit arenas to control the number of threads that participate in executing parallel work. In Chapters 12–14, we saw that explicit task arenas can be used with task_sheduler_observer objects to set the properties of threads as they join arenas. Because of the impact of task arenas on available parallelism and data locality, in this section, we take a closer look at how task arenas mix with flow graphs.
The Default Arena Used by a Flow Graph
When we construct a tbb::flow::graph object, the graph object captures a reference to the arena of the thread that constructed the object. Whenever a task is spawned to execute work in the graph, the tasks are spawned in this arena, not in the arena of the thread that caused the task to be spawned.
Why?
Well, TBB flow graphs are less structured than TBB parallel algorithms. TBB algorithms use fork-join parallelism and the behavior of TBB task arenas matches this pattern well – each master thread has its own default arena and so if different master threads execute algorithms concurrently, their tasks are isolated from each other in different task arenas. But with a TBB flow graph, there may be one or more master threads explicitly putting messages into the same graph. If the tasks related to these interactions are spawned in each master thread’s arena, some tasks from a graph would be isolated from other tasks from the same graph. This is very likely not the behavior we would like.
So instead, all tasks are spawned into a single arena, the arena of the thread that constructed the graph object.
Changing the Task Arena Used by a Flow Graph
Setting the Number of Threads, Thread-to-Core Affinities, etc.
Now that we know how to associate task arenas with flow graphs, we can use all of the performance tuning optimizations described in Chapters 11–14 that rely on task arenas. For example, we can use task arenas to isolate one flow graph from another. Or, we can pin threads to cores for a particular task arena using a task_scheduler_observer object and then associate that arena with a flow graph.
Key FG Advice: Dos and Don’ts
The flow graph API is flexible – maybe too flexible. When first working with flow graph, the interface can be daunting since there are so many options. In this section, we provide several dos and don’ts that capture some of our experience when using this high-level interface. However, just like with our rule of thumb for node execution time, these are just suggestions. There are many valid patterns of usage that are not captured here, and we’re sure that some of the patterns we say to avoid may have valid use cases. We present these best-known methods, but your mileage may vary.
Do: Use Nested Parallelism
Just like with a pipeline, a flow graph can have great scalability if it uses parallel (flow::unlimited) nodes but can have limited scalability if it has serial nodes. One way to increase scaling is to use nested parallel algorithms inside of TBB flow graph nodes. TBB is all about composability, so we should use nested parallelism when possible.
Don’t: Use Multifunction Nodes in Place of Nested Parallelism
In Figure 17-20, for each message that the multifunction_node receives, it generates many output messages that flow into a function_node with unlimited concurrency. This graph will act a lot like a parallel loop, with the multifunction_node acting as the control loop and the function_node as the body. But it will require a lot of stealing to distribute the work like the Master loop from Figures 17-3 and 17–5. While there may be valid uses of this pattern, it is likely more efficient to use a highly optimized parallel loop algorithm instead. This entire graph might be collapsed into a single node that contains a nested parallel_for, for example. Of course, whether or not this replacement is possible or desirable depends on the application.
Do: Use join_node, sequencer_node, or multifunction_node to Reestablish Order in a Flow Graph When Needed
Because a flow graph is less structured than a simple pipeline, we may sometimes need to establish an ordering of messages at points in the graph. There are three common approaches for establishing order in a data flow graph: use a key-matching join_node, use a sequencer_node, or use a multifunction_node.
For example, in Chapter 3, the parallelism in our stereoscopic 3D flow graph allowed the left and right images to arrive out of order at the mergeImageBuffersNode. In that example, we ensured that the correct two images were paired together as inputs to the mergeImageBuffersNode by using a tag-matching join_node. A tag-matching join_node is a type of key-matching join_node. By using this join_node type, inputs can arrive in different orders at the two input ports but will still be properly matched based on their tag or key. You can find more information on the different join policies in Appendix B.
Another way to establish order is to use a sequencer_node. A sequencer_node is a buffer that outputs messages in sequence order, using a user-provided body object to obtain the sequence number from the incoming message.
In Figure 17-21, we can see a three-node graph, with nodes first_node, sequencer, and last_node. We use a sequencer_node to reestablish the input order of the messages before the final serial output node last_node. Because function_node first_node is unlimited, its tasks can finish out of order and send their output as they complete. The sequencer_node reestablishes the input order by using the sequence number assigned when each message was originally constructed.
As we can see, a sequencer_node can reestablish the order of the messages, but it does require us to assign the sequence number and also to provide a body to the sequencer_node that can obtain that number from an incoming message.
A final approach to establishing order is to use a serial multifunction_node . A multifunction_node can output zero or more messages on any of its output ports for a given input message. Since it is not forced to output a message for each incoming message, it can buffer incoming messages and hold them until some user-defined ordering constraint is met.
While Figure 17-22 shows how a multifunction_node can be used to reorder messages by sequence order, in general, any user-defined ordering or bundling of messages can be used.
Do: Use the Isolate Function for Nested Parallelism
As we discussed in Chapter 12, moonlighting is typically benign, which is the case here since we’re not computing anything real. But as we highlighted in our previous discussions about isolation, this behavior is not always benign and can lead to correctness issues, or decreased performance, in some cases.
Do: Use Cancellation and Exception Handling in Flow Graphs
In Chapter 15, we discussed task cancellation and exception handling when using TBB tasks in general. Since we are already familiar with this topic, we will only highlight the flow graph related aspects in this section.
Each Flow Graph Uses a Single task_group_context
If we don’t pass one to the constructor, a default object will be created for us.
Canceling a Flow Graph
Resetting a Flow Graph After Cancellation
If a graph is canceled, whether directly or due to an exception, we need to reset the graph, g.reset(), before we can use it again. This resets the state of the graph – clearing internal buffers, putting the edges back into their initial states, and so on. See Appendix B for more details.
Exception Handling Examples
So far, none of this is very exceptional (pun intended); it’s just how exceptions should work.
The exception thrown in node node2 is not caught in the node’s body, so it will propagate to the thread that waits at the call to wait_for_all. If a node’s body throws an exception, the graph it belongs to is canceled. In this case, we see that there is no second “Caught” message, since node2 will only execute once.
And of course, if we want to re-execute the graph after we deal with the exception that we catch at the wait_for_all, we need to call g.reset() since the graph has been canceled.
Do: Set a Priority for a Graph Using task_group_context
Or we can pass in a task_group_context object with a preset priority to the graph’s constructor. In either case though, this sets the priorities for all of the tasks related to the graph. We can create one graph with a high priority and another graph with a low priority.
Shortly before the publication of this book, support for relative priorities for functional nodes was added to TBB as a preview feature. Using this feature, we can pass a parameter to a node’s constructor to give it a priority relative to other functional nodes. This interface was first provided in TBB 2019 Update 3. Interested readers can learn more details about this new functionality in the online TBB release notes and documentation.
Don’t: Make an Edge Between Nodes in Different Graphs
All graph nodes require a reference to a graph object as one of the arguments to their constructor. In general, it is only safe to construct edges between nodes that are part of the same graph. Connecting two nodes in different graphs can make it difficult to reason about graph behaviors, such as what task arenas will be used, if our calls to wait_for_all will properly detect graph termination, and so on. To optimize performance, the TBB library takes advantage of its knowledge about edges. If we connect two graphs by an edge, the TBB library will freely reach across this edge for optimization purposes. We may believe that we have created two distinct graphs, but if there are shared edges, TBB can start mixing their executions together in unexpected ways.
Figure 17-30 provides an example that uses the WhereAmIRunningBody to demonstrate an unexpected behavior. In this example, we create two nodes: g2_node and g4_node. The node g2_node is constructed with a reference to g2. The graph g2 is passed a reference to a task_group_context that has priority_normal and g2 is reset() in a task_arena with a concurrency of 2. We should therefore expect g2_node to execute with normal priority in an arena with 2 threads, right? The node g4_node is constructed such that we should expect it to execute with high priority in an arena with four threads.
From this simple example, we can see that this edge breaks the separation between the graphs. If we were using arenas a2 and a4 to control the number of threads, for work isolation or for thread affinity purposes, this edge will undo our efforts. We should not make edges between graphs.
Do: Use try_put to Communicate Across Graphs
In the previous “Don’t,” we decided that we should not make edges between graphs. But what if we really need to communicate across graphs? The least dangerous option is to explicitly call try_put to send a message from a node in one graph to a node in another graph. We don’t introduce an edge, so the TBB library won’t do anything sneaky to optimize the communication between the two nodes. Even in this case though, we still need to be careful as our example in Figure 17-31 demonstrates.
Here, we create a graph g2 that sends a message to graph g1 and then waits for both graph g1 and g2. But, the waiting is done in the wrong order!
But still, we can see that using explicit try_puts is not without dangers. We need to be very careful when graphs communicate with each other!
Do: Use composite_node to Encapsulate Groups of Nodes
In the previous two sections, we warned that communication between graphs can lead to errors. Often developers use more than one graph because they want to logically separate some nodes from others. Encapsulating a group of nodes is convenient if there is a common pattern that needs to be created many times or if there is too much detail in one large flat graph.
In both of these cases, we can use a tbb::flow::composite_node. A composite_node is used to encapsulate a collection of other nodes so they can be used like a first-class graph node. Its interface follows:
If this token passing pattern is commonly used in our application, or by members of our development team, it might make sense to encapsulate it into its own node type, as shown in Figure 17-32(b). It also cleans up the high-level view of our application by hiding the details.
In Figure 17-34, MergeNode inherits from CompositeType, which is an alias for
The two template arguments indicate that a MergeNode will have two input ports, both that receive BigObjectPtr messages, and a single output port that sends BigObjectPtr messages. The class MergeNode has a member variable for each node it encapsulates: a tokenBuffer, a join, and a combine node. And these member variables are initialized in the member initializer list of the MergeNode constructor. In the constructor body, calls to tbb::flow::make_edge set up all of the internal edges. A call to set_external_ports is used to assign the ports from the member nodes to the external ports of the MergeNode. In this case, the first two input ports of join become the inputs of the MergeNode and the output of combine becomes the output the MergeNode. Finally, because the node is implementing a token passing scheme, the tokenBuffer is filled with tokens.
While creating a new type that inherits from tbb::flow::composite_node may appear daunting at first, using this interface can lead to more readable and reusable code, especially as your flow graphs become larger and more complicated.
Introducing Intel Advisor: Flow Graph Analyzer
The Flow Graph Analyzer (FGA) tool is available in Intel Parallel Studio XE 2019 and later. It is provided as a feature of the Intel Advisor tool. Instructions for getting the tool can be found at https://software.intel.com/en-us/articles/intel-advisor-xe-release-notes .
FGA was developed to support the design, debugging, visualization, and analysis of graphs built using the TBB flow graph API. That said, many of the capabilities of FGA are generically useful for analyzing computational graphs, regardless of their origin. Currently, the tool has limited support for other parallel programming models including the OpenMP API.
For our purposes in this book, we will focus only on how the design and analysis workflows in the tool apply to TBB. We also use FGA to analyze some of the samples in this chapter. However, all of the optimizations presented in this chapter can be done with or without FGA. So, if you have no interest in using FGA, you can skip this section. But again, we believe there is significant value in this tool, so skipping it would be a mistake.
The FGA Design Workflow
The design workflow in FGA lets us graphically design TBB flow graphs, validate that they are correct, estimate their scalability, and, after we are satisfied with our design, generate a C++ implementation that uses the TBB flow graph classes and functions. FGA is not a full Integrated Development Environment (IDE) like Microsoft Visual Studio, Eclipse or Xcode. Instead, it gets us started with our flow graph design, but then we need to step outside of the tool to complete the development. However, if we use the design workflow in a constrained way, as we will describe later, iterative development in the designer is possible.
The typical design workflow starts with a blank canvas and project. As highlighted by the black circle numbered 1 in Figure 17-35, we select nodes in the node palette and place them on the canvas, connecting them together by drawing edges between their ports. The node palette contains all of the node types available in the TBB flow graph interface and provides tooltips that remind us about the functionality of each type. For each node on the canvas, we can modify its type-specific properties; for a function_node for example, we can provide the C++ code for the body, set a concurrency limit, and so on. We can also provide an estimated “weight” that represents the computational complexity of the node so that later we can run a Scalability Analysis to see if our graph will perform well.
Once we have drawn our graph on the canvas, we run a Rule Check that analyzes the graph looking for common mistakes and anti-patterns. The Rule Check results, highlighted by the black circle numbered 2 in Figure 17-35, show issues such as unnecessary buffering, type mismatches, suspicious cycles in the graph, and so on. In Figure 17-35, the Rule Check has discovered that there is a type mismatch between the input of our limiter_node and the output of our multifunction_node. In response, we can then, for example, modify the port output type of our multifunction_node to fix this issue.
When we have fixed all correctness issues uncovered by the Rule Check, we can then run a Scalability Analysis. The Scalability Analysis constructs a TBB flow graph in memory, replacing the computational node bodies with dummy bodies that actively spin for a time proportional to their “weight” property. FGA runs this model of our graph on various numbers of threads and provides a table of the speedups, for example:
Using these features, we can iteratively refine our graph design. Along the way, we can save our graph design in GraphML format (a common standard for representing graphs). When we are satisfied with our design we can generate C++ code that uses the TBB flow graph interface to express our design. This code generator is more accurately viewed as a code wizard than an IDE since it does not directly support an iterative code development model. If we change the generated code, there is no way to reimport our changes into the tool.
Tips for Iterative Development with FGA
If we want to create a design that we can continue to tune from within FGA, we can use a constrained approach, where we specify node bodies that redirect to implementations that are maintained outside of FGA. This is necessary because there is no way to reimport modified C++ code back into FGA.
For example, if we want to make iterative development easier, we should not specify a function_node that exposes its implementation directly in the body code:
Instead, we should specify only the interface and redirect to an implementation that can be maintained separately:
If we take this constrained approach, we can often maintain the graph design in FGA and its GraphML representation, iteratively tuning the topology and node properties without losing any node body implementation changes we make outside of the tool. Whenever we generate new C++ code from FGA, we simply include the most up-to-date implementation header and the node bodies use these implementations that are maintained outside of the tool.
Flow Graph Analyzer does not require us to use this approach of course, but it is good practice if we want to use the code generation features of FGA as more than a simple code wizard.
The FGA Analysis Workflow
The analysis workflow in FGA is independent of the design workflow. While we can surely analyze a flow graph that was designed in FGA, we can just as easily analyze a TBB flow graph that is designed and implemented outside of the tool. This is possible because the TBB library is instrumented to provide runtime events to the FGA trace collector. A trace collected from a TBB application lets FGA reconstruct the graph structure and the timeline of the node body executions – it does not depend on the GraphML files developed during the design workflow.
If we want to use FGA to analyze a TBB application that uses a flow graph, the first step is to collect an FGA trace. By default, TBB does not generate traces, so we need to activate trace collection. The FGA instrumentation in TBB was a preview feature prior to TBB 2019. We need to take extra steps if we are using an older version of TBB. We refer readers to the FGA documentation for instructions on how to collect traces for the version of TBB and FGA that they are using.
The tree-map view labeled as (1) in Figure 17-36 provides an overview of the overall health of a graph. In the tree map, the area of each rectangle represents the total aggregated CPU time of the node and the color of each square indicates the concurrency observed during the execution of the node. The concurrency information is categorized as poor (red), ok (orange), good (green), and oversubscribed (blue).
Nodes with a large area that are marked as “poor” are hotspots and have an average concurrency between 0% and 25% of the hardware concurrency. These are therefore good candidates for optimization. The tree-map view also serves as an index into a large graph; clicking on a square will highlight the node in the graph and selecting this highlighted node will in turn mark tasks from all instances of this node in the timeline trace view.
The graph topology canvas is synchronized with other views in the tool. Selecting a node in the tree-map view, the timeline, or in a data analytics report will highlight the node in the canvas. This lets users quickly relate performance data to the graph structure.
One of the most important analytic reports provided by FGA is the list of critical paths in a graph. This feature is particularly useful when one has to analyze a large and complex graph. Computing the critical paths results in a list of nodes that form the critical paths as shown in the region labeled (2) in Figure 17-36. As we discussed in Chapter 3, an upper bound on speedup of dependency graphs can be quickly computed by dividing the aggregate total time spent by all nodes in a graph by the time spent on the longest critical path, T1/T∞. This upper bound can be used to set expectations on the potential speedup for an application expressed as a graph.
The timeline and concurrency view labeled as (3) in Figure 17-36 displays the raw traces in swim lanes mapped to software threads. Using this trace information, FGA computes additional derived data such as the average concurrency of each node and the concurrency histogram over time for the graph execution. Above the per-thread swim lanes, a histogram shows how many nodes are active at that point in time. This view lets users quickly identify time regions with low concurrency. Clicking on nodes in the timelines during these regions of low concurrency lets developers find the structures in their graph that lead to these bottlenecks.
Diagnosing Performance Issues with FGA
In this chapter, we discussed a number of potential performance issues that can arise when using a flow graph. In this section, we briefly discuss how FGA can be used to explore these issues in a TBB-based application.
Diagnosing Granularity Issues with FGA
Just like with our TBB generic loop algorithms, we need to be concerned about tasks that are too small to profit from parallelization. But we need to balance this concern with the need to create enough tasks to allow our workload to scale. In particular, as we discussed in Chapter 3, scalability can be limited by serial nodes if they become a bottleneck in the computation.
In contrast, there are regions in Figure 17-37 where smaller tasks, named n, are executed in parallel. By their coloring, it appears these are close to the 1 microsecond threshold, and consequently we can see gaps in the timelines during this region, indicating that there may be some non-negligible scheduling overheads involved. In this case, it may benefit us to merge nodes or to use a lightweight policy if possible to decrease overheads.
Recognizing Slow Copies in FGA
When using FGA to analyze our flow graph applications, gaps in the timeline indicate inefficiencies that need to be further investigated. In this section, they indicated costly copies between nodes and in the previous section they indicated that the overhead of scheduling was large compared to the task sizes. In both cases, these gaps should prompt us to look for ways to improve performance.
Diagnosing Moonlighting using FGA
In a Stacked View, we see all of the nested tasks that a thread is executing, including those that come from flow graph nodes and those that come from TBB Generic Parallel Algorithms. If we see that a thread executes two nodes concurrently, it is moonlighting. In Figure 17-39, for example, we see that Thread 0 starts executing another instance of node n0 inside of an existing instance of n0. In our previous discussions about moonlighting, we know this can happen if a thread steals work while it is waiting for a nested parallel algorithm to complete. The Stacked View in Figure 17-39, lets us easily see that a nested parallel_for, labeled p8, is the culprit in this case.
Using the timeline views from FGA, we can identify when threads are moonlighting simply by noticing a thread’s overlapped participation in more than one region or node. As developers, and possibly through other interactions with FGA, we then need to determine if the moonlighting is benign or needs to be addressed by TBB’s isolation features.
Summary
The flow graph API is a flexible and powerful interface for creating dependency and data flow graphs. In this chapter, we discussed some of the more advanced considerations in using the TBB flow graph high-level execution interface. Because it is implemented on top of TBB tasks, it shares the composability and optimization features supported by TBB tasks. We discussed how these can be used to optimize for granularity, effective cache, and memory use and create sufficient parallelism. We then listed some dos and don’ts that can be helpful when first exploring the flow graph interfaces. Finally, we provided a brief overview of the Flow Graph Analyzer (FGA), a tool available in Intel Parallel Studio XE that has support for the graphical design and analysis of TBB flow graphs.
For More Information
Michael Voss, “The Intel Threading Building Blocks Flow Graph,” Dr. Dobb’s, October 5, 2011. www.drdobbs.com/tools/the-intel-threading-building-blocks-flow/231900177 .
Vasanth Tovinkere, Pablo Reble, Farshad Akhbari and Palanivel Guruvareddiar, “Driving Code Performance with Intel Advisor’s Flow Graph Analyzer,” Parallel Universe Magazine, https://software.seek.intel.com/driving-code-performance .
Richard Friedman, “Intel Advisor’s TBB Flow Graph Analyzer: Making Complex Layers of Parallelism More Manageable,” Inside HPC, December 14, 2017, https://insidehpc.com/2017/12/intel-flow-graph-analyzer/ .
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.