© Intel Corporation 2021
J. Reinders et al.Data Parallel C++https://doi.org/10.1007/978-1-4842-5574-2_2

2. Where Code Executes

James Reinders1 , Ben Ashbaugh2, James Brodman3, Michael Kinsner4, John Pennycook5 and Xinmin Tian6
(1)
Beaverton, OR, USA
(2)
Folsom, CA, USA
(3)
Marlborough, MA, USA
(4)
Halifax, NS, Canada
(5)
San Jose, CA, USA
(6)
Fremont, CA, USA
 

../images/489625_1_En_2_Chapter/489625_1_En_2_Figa_HTML.gif

Parallel programming is not really about driving in the fast lane. It is actually about driving fast in all the lanes. This chapter is all about enabling us to put our code everywhere that we can. We choose to enable all the compute resources in a heterogeneous system whenever it makes sense. Therefore, we need to know where those compute resources are hiding (find them) and put them to work (execute our code on them).

We can control where our code executes—in other words, we can control which devices are used for which kernels. SYCL provides a framework for heterogeneous programming in which code can execute on a mixture of a host CPU and devices. The mechanisms which determine where code executes are important for us to understand and use.

This chapter describes where code can execute, when it will execute, and the mechanisms used to control the locations of execution. Chapter 3 will describe how to manage data so it arrives where we are executing our code, and then Chapter 4 returns to the code itself and discusses the writing of kernels.

Single-Source

SYCL programs can be single-source, meaning that the same translation unit (typically a source file and its headers) contains both the code that defines the compute kernels to be executed on SYCL devices and also the host code that orchestrates execution of those kernels. Figure 2-1 shows these two code paths graphically, and Figure 2-2 provides an example application with the host and device code regions marked.

Combining both device and host code into a single-source file (or translation unit) can make it easier to understand and maintain a heterogeneous application. The combination also provides improved language type safety and can lead to more compiler optimizations of our code.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig1_HTML.png
Figure 2-1

Single-source code contains both host code (runs on CPU) and device code (runs on SYCL devices)

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig2_HTML.png
Figure 2-2

Simple SYCL program

Host Code

Applications contain C++ host code, which is executed by the CPU(s) on which the operating system has launched the application. Host code is the backbone of an application that defines and controls assignment of work to available devices. It is also the interface through which we define the data and dependences that should be managed by the runtime.

Host code is standard C++ augmented with SYCL-specific constructs and classes that are designed to be implementable as a C++ library. This makes it easier to reason about what is allowed in host code (anything that is allowed in C++) and can simplify integration with build systems.

SYCL applications are standard C++ augmented with constructs that can be implemented as a C++ library.

A SYCL compiler may provide higher performance for a program by “understanding” these constructs.

The host code in an application orchestrates data movement and compute offload to devices, but can also perform compute-intensive work itself and can use libraries like any C++ application.

Device Code

Devices correspond to accelerators or processors that are conceptually independent from the CPU that is executing host code. An implementation must expose the host processor also as a device, as described later in this chapter, but the host processor and devices should be thought of as logically independent from each other. The host processor runs native C++ code, while devices run device code.

Queues are the mechanism through which work is submitted to a device for future execution. There are three important properties of device code to understand:
  1. 1.

    It executes asynchronously from the host code. The host program submits device code to a device, and the runtime tracks and starts that work only when all dependences for execution are satisfied (more on this in Chapter 3). The host program execution carries on before the submitted work is started on a device, providing the property that execution on devices is asynchronous to host program execution, unless we explicitly tie the two together.

     
  2. 2.

    There are restrictions on device code to make it possible to compile and achieve performance on accelerator devices. For example, dynamic memory allocation and runtime type information (RTTI) are not supported within device code, because they would lead to performance degradation on many accelerators. The small set of device code restrictions is covered in detail in Chapter 10.

     
  3. 3.

    Some functions and queries defined by SYCL are available only within device code, because they only make sense there, for example, work-item identifier queries that allow an executing instance of device code to query its position in a larger data-parallel range (described in Chapter 4).

     

In general, we will refer to work including device code that is submitted to the queue as actions. In Chapter 3, we will learn that actions include more than device code to execute; actions also include memory movement commands. In this chapter, since we are concerned with the device code aspect of actions, we will be specific in mentioning device code much of the time.

Choosing Devices

To explore the mechanisms that let us control where device code will execute, we’ll look at five use cases:
  • Method#1: Running device code somewhere, when we don’t care which device is used. This is often the first step in development because it is the simplest.

  • Method#2: Explicitly running device code on the host device, which is often used for debugging. The host device is guaranteed to be always available on any system.

  • Method#3: Dispatching device code to a GPU or another accelerator.

  • Method#4: Dispatching device code to a heterogeneous set of devices, such as a GPU and an FPGA.

  • Method#5: Selecting specific devices from a more general class of devices, such as a specific type of FPGA from a collection of available FPGA types.

Developers will typically debug their code as much as possible with Method#2 and only move to Methods #3–#5 when code has been tested as much as is practical with Method#2.

Method#1: Run on a Device of Any Type

When we don’t care where our device code will run, it is easy to let the runtime pick for us. This automatic selection is designed to make it easy to start writing and running code, when we don’t yet care about what device is chosen. This device selection does not take into account the code to be run, so should be considered an arbitrary choice which likely won’t be optimal.

Before talking about choice of a device, even one that the implementation has selected for us, we should first cover the mechanism through which a program interacts with a device: the queue.

Queues

A queue is an abstraction to which actions are submitted for execution on a single device. A simplified definition of the queue class is given in Figures 2-3 and 2-4. Actions are usually the launch of data-parallel compute, although other commands are also available such as manual control of data motion for when we want more control than the automatic movement provided by the runtime. Work submitted to a queue can execute after prerequisites tracked by the runtime are met, such as availability of input data . These prerequisites are covered in Chapters 3 and 8.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig3_HTML.png
Figure 2-3

Simplified definition of the constructors of the queue class

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig4_HTML.png
Figure 2-4

Simplified definition of key member functions in the queue class

A queue is bound to a single device, and that binding occurs on construction of the queue. It is important to understand that work submitted to a queue is executed on the single device to which that queue is bound. Queues cannot be mapped to collections of devices because that would create ambiguity on which device should perform work. Similarly, a queue cannot spread the work submitted to it across multiple devices. Instead, there is an unambiguous mapping between a queue and the device on which work submitted to that queue will execute, as shown in Figure 2-5.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig5_HTML.png
Figure 2-5

A queue is bound to a single device. Work submitted to the queue executes on that device

Multiple queues may be created in a program, in any way that we desire for application architecture or programming style. For example, multiple queues may be created to each bind with a different device or to be used by different threads in a host program. Multiple different queues can be bound to a single device, such as a GPU, and submissions to those different queues will result in the combined work being performed on the device. An example of this is shown in Figure 2-6. Conversely, as we mentioned previously, a queue cannot be bound to more than one device because there must not be any ambiguity on where an action is being requested to execute. If we want a queue that will load balance work across multiple devices, for example, then we can create that abstraction in our code.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig6_HTML.png
Figure 2-6

Multiple queues can be bound to a single device

Because a queue is bound to a specific device, queue construction is the most common way in code to choose the device on which actions submitted to the queue will execute. Selection of the device when constructing a queue is achieved through a device selector abstraction and associated device_selector class .

Binding a Queue to a Device, When Any Device Will Do

Figure 2-7 is an example where the device that a queue should bind to is not specified. The trivial queue constructor that does not take any arguments (as in Figure 2-7) simply chooses some available device behind the scenes. SYCL guarantees that at least one device will always be available—namely, the host device. The host device can run kernel code and is an abstraction of the processor on which the host program is executing so is always present.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig7_HTML.png
Figure 2-7

Implicit default device selector through trivial construction of a queue

Using the trivial queue constructor is a simple way to begin application development and to get device code up and running. More control over selection of the device bound to a queue can be added as it becomes relevant for our application.

Method#2: Using the Host Device for Development and Debugging

The host device can be thought of as enabling the host CPU to act as if it was an independent device, allowing our device code to execute regardless of the accelerators available in a system. We always have some processor running the host program, so the host device is therefore always available to our application. The host device provides a guarantee that device code can always be run (no dependence on accelerator hardware) and has a few primary uses:
  1. 1.

    Development of device code on less capable systems that don’t have any accelerators: One common use is development and testing of device code on a local system, before deploying to an HPC cluster for performance testing and optimization.

     
  2. 2.

    Debugging of device code with non-accelerator tooling: Accelerators are often exposed through lower-level APIs that may not have debug tooling as advanced as is available for host CPUs. With this in mind, the host device is expected to support debugging using standard tools familiar to CPU developers.

     
  3. 3.

    Backup if no other devices are available, to guarantee that device code can be executed functionally: The host device implementation may not have performance as a primary goal, so should be considered as a functional backup to ensure that device code can always execute in any application, but not necessarily a path to performance.

     
The host device is functionally like a hardware accelerator device in that a queue can bind to it and it can execute device code. Figure 2-8 shows how the host device is a peer to other accelerators that might be available in a system. It can execute device code, just like a CPU, GPU, or FPGA, and can have one or more queues constructed that bind to it.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig8_HTML.png
Figure 2-8

The host device, which is always available, can execute device code like any accelerator

An application can choose to create a queue that is bound to the host device by explicitly passing host_selector to a queue constructor, as shown in Figure 2-9.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig9_HTML.png
Figure 2-9

Selecting the host device using the host_selector class

Even when not specifically requested (e.g., using host_selector), the host device might happen to be chosen by the default selector as occurred in the output in Figure 2-7.

A few variants of device selector classes are defined to make it easy for us to target a type of device. The host_selector is one example of these selector classes, and we’ll get into others in the coming sections.

Method#3: Using a GPU (or Other Accelerators)

GPUs are showcased in the next example, but any type of accelerator applies equally. To make it easy to target common classes of accelerators, devices are grouped into several broad categories, and SYCL provides built-in selector classes for them. To choose from a broad category of device type such as “any GPU available in the system,” the corresponding code is very brief, as described in this section.

Device Types

There are two main categories of devices to which a queue can be bound:
  1. 1.

    The host device, which has already been described.

     
  2. 2.

    Accelerator devices such as a GPU, an FPGA , or a CPU device, which are used to accelerate workloads in our applications.

     

Accelerator Devices

There are a few broad groups of accelerator types :
  1. 1.

    CPU devices

     
  2. 2.

    GPU devices

     
  3. 3.

    Accelerators, which capture devices that don’t identify as either a CPU device or a GPU device. This includes FPGA and DSP devices.

     

A device from any of these categories is easy to bind to a queue using built-in selector classes, which can be passed to queue (and some other class) constructors.

Device Selectors

Classes that must be bound to a specific device, such as the queue class, have constructors that can accept a class derived from device_selector . For example, the queue constructor is
queue( const device_selector &deviceSelector,
  const property_list &propList = {});
There are five built-in selectors for the broad classes of common devices:

default_selector

Any device of the implementation’s choosing.

host_selector

Select the host device (always available).

cpu_selector

Select a device that identifies itself as a CPU in device queries.

gpu_selector

Select a device that identifies itself as a GPU in device queries.

accelerator_selector

Select a device that identifies itself as an “accelerator,” which includes FPGAs.

One additional selector included in DPC++ (not available in SYCL) is available by including the header "CL/sycl/intel/fpga_extensions.hpp":

INTEL::fpga_selector

Select a device that identifies itself as an FPGA.

A queue can be constructed using one of the built-in selectors, such as

../images/489625_1_En_2_Chapter/489625_1_En_2_Figc_HTML.gif

Figure 2-10 shows a complete example using the cpu_selector, and Figure 2-11 shows the corresponding binding of a queue with an available CPU device.

Figure 2-12 shows an example using a variety of built-in selector classes and also demonstrates use of device selectors with another class (device) that accepts a device_selector on construction.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig10_HTML.png
Figure 2-10

CPU device selector example

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig11_HTML.png
Figure 2-11

Queue bound to a CPU device available to the application

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig12_HTML.png
Figure 2-12

Example device identification output from various classes of device selectors and demonstration that device selectors can be used for construction of more than just a queue (in this case, construction of a device class instance)

When Device Selection Fails

If a gpu_selector is used when creating an object such as a queue and if there are no GPU devices available to the runtime, then the selector throws a runtime_error exception. This is true for all device selector classes in that if no device of the required class is available, then a runtime_error exception is thrown. It is reasonable for complex applications to catch that error and instead acquire a less desirable (for the application/algorithm) device class as an alternative. Exceptions and error handling are discussed in more detail in Chapter 5.

Method#4: Using Multiple Devices

As shown in Figures 2-5 and 2-6, we can construct multiple queues in an application. We can bind these queues to a single device (the sum of work to the queues is funneled into the single device), to multiple devices, or to some combination of these. Figure 2-13 provides an example that creates one queue bound to a GPU and another queue bound to an FPGA. The corresponding mapping is shown graphically in Figure 2-14.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig13_HTML.png
Figure 2-13

Creating queues to both GPU and FPGA devices

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig14_HTML.png
Figure 2-14

GPU + FPGA device selector example: One queue is bound to a GPU and another to an FPGA

Method#5: Custom (Very Specific) Device Selection

We will now look at how to write a custom selector. In addition to examples in this chapter, there are a few more examples shown in Chapter 12. The built-in device selectors are intended to let us get code up and running quickly. Real applications usually require specialized selection of a device, such as picking a desired GPU from a set of GPU types available in a system. The device selection mechanism is easily extended to arbitrarily complex logic, so we can write whatever code is required to choose the device that we prefer.

device_selector Base Class

All device selectors derive from the abstract device_selector base class and define the function call operator in the derived class:

../images/489625_1_En_2_Chapter/489625_1_En_2_Figb_HTML.png
Defining this operator in a class that derives from device_selector is all that is required to define any complexity of selection logic, once we know three things:
  1. 1.

    The function call operator is automatically called once for each device that the runtime finds as accessible to the application, including the host device.

     
  2. 2.

    The operator returns an integer score each time that it is invoked. The highest score across all available devices is the device that the selector chooses.

     
  3. 3.

    A negative integer returned by the function call operator means that the device being considered must not be chosen.

     

Mechanisms to Score a Device

We have many options to create an integer score corresponding to a specific device, such as the following:
  1. 1.

    Return a positive value for a specific device class.

     
  2. 2.

    String match on a device name and/or device vendor strings.

     
  3. 3.

    Anything we can imagine in code leading to an integer value, based on device or platform queries.

     
For example, one possible approach to select an Intel Arria family FPGA device is shown in Figure 2-15.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig15_HTML.png
Figure 2-15

Custom selector for Intel Arria FPGA device

Chapter 12 has more discussion and examples for device selection (Figures 12-2 and 12-3) and discusses the get_info method in more depth.

Three Paths to Device Code Execution on CPU

A potential source of confusion comes from the multiple mechanisms through which a CPU can have code executed on it, as summarized in Figure 2-16.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig16_HTML.png
Figure 2-16

SYCL mechanisms to execute on a CPU

The first and most obvious path to CPU execution is host code, which is either part of the single-source application (host code regions) or linked to and called from the host code such as a library function.

The other two available paths execute device code. The first CPU path for device code is through the host device, which was described earlier in this chapter. It is always available and is expected to execute the device code on the same CPU(s) that the host code is executing on.

A second path to execution of device code on a CPU is optional in SYCL and is a CPU accelerator device that is optimized for performance. This device is often implemented by a lower-level runtime such as OpenCL, so its availability can depend on drivers and other runtimes installed on the system. This philosophy is described by SYCL where the host device is intended to be debuggable with native CPU tools, while CPU devices may be built on implementations optimized for performance where native CPU debuggers are not available.

Although we don’t cover it in this book, there is a mechanism to enqueue regular CPU code (top part of Figure 2-16) when prerequisites in the task graph are satisfied. This advanced feature can be used to execute regular CPU code alongside device code in the task graph and is known as a host task.

Creating Work on a Device

Applications usually contain a combination of both host code and device code. There are a few class members that allow us to submit device code for execution, and because these work dispatch constructs are the only way to submit device code, they allow us to easily distinguish device code from host code.

The remainder of this chapter introduces some of the work dispatch constructs, with the goal to help us understand and identify the division between device code and host code that executes natively on the host processor.

Introducing the Task Graph

A fundamental concept in the SYCL execution model is a graph of nodes. Each node (unit of work) in this graph contains an action to be performed on a device, with the most common action being a data-parallel device kernel invocation. Figure 2-17 shows an example graph with four nodes, where each node can be thought of as a device kernel invocation.

The nodes in Figure 2-17 have dependence edges defining when it is legal for a node’s work to begin execution. The dependence edges are most commonly generated automatically from data dependences, although there are ways for us to manually add additional custom dependences when we want to. Node B in the graph, for example, has a dependence edge from node A. This edge means that node A must complete execution, and most likely (depending on specifics of the dependence) make generated data available on the device where node B will execute, before node B’s action is started. The runtime controls resolution of dependences and triggering of node executions completely asynchronously from the host program’s execution. The graph of nodes defining an application will be referred to in this book as the task graph and is covered in more detail in Chapter 3.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig17_HTML.png
Figure 2-17

The task graph defines actions to perform (asynchronously from the host program) on one or more devices and also dependences that determine when an action is safe to execute

../images/489625_1_En_2_Chapter/489625_1_En_2_Fig18_HTML.png
Figure 2-18

Submission of device code

Where Is the Device Code?

There are multiple mechanisms that can be used to define code that will be executed on a device, but a simple example shows how to identify such code. Even if the pattern in the example appears complex at first glance, the pattern remains the same across all device code definitions so quickly becomes second nature.

The code passed as the final argument to the parallel_for, defined as a lambda in Figure 2-18, is the device code to be executed on a device. The parallel_for in this case is the construct that lets us distinguish device code from host code. parallel_for is one of a small set of device dispatch mechanisms, all members of the handler class, that define the code to be executed on a device. A simplified definition of the handler class is given in Figure 2-19.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig19_HTML.png
Figure 2-19

Simplified definition of member functions in the handler class

In addition to calling members of the handler class to submit device code, there are also members of the queue class that allow work to be submitted. The queue class members shown in Figure 2-20 are shortcuts that simplify certain patterns, and we will see these shortcuts used in future chapters.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig20_HTML.png
Figure 2-20

Simplified definition of member functions in the queue class that act as shorthand notation for equivalent functions in the handler class

Actions

The code in Figure 2-18 contains a parallel_for , which defines work to be performed on a device. The parallel_for is within a command group (CG) submitted to a queue, and the queue defines the device on which the work is to be performed. Within the command group, there are two categories of code:
  1. 1.

    Exactly one call to an action that either queues device code for execution or performs a manual memory operation such as copy.

     
  2. 2.

    Host code that sets up dependences defining when it is safe for the runtime to start execution of the work defined in (1), such as creation of accessors to buffers (described in Chapter 3).

     
The handler class contains a small set of member functions that define the action to be performed when a task graph node is executed. Figure 2-21 summarizes these actions.
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig21_HTML.png
Figure 2-21

Actions that invoke device code or perform explicit memory operations

Only a single action from Figure 2-21 may be called within a command group (it is an error to call more than one), and only a single command group can be submitted to a queue per submit call. The result of this is that a single operation from Figure 2-21 exists per task graph node, to be executed when the node dependences are met and the runtime determines that it is safe to execute.

A command group must have exactly one action within it, such as a kernel launch or explicit memory operation.

The idea that code is executed asynchronously in the future is the critical difference between code that runs on the CPU as part of the host program and device code that will run in the future when dependences are satisfied. A command group usually contains code from each category, with the code that defines dependences running as part of the host program (so that the runtime knows what the dependences are) and device code running in the future once the dependences are satisfied.

There are three classes of code in Figure 2-22:
  1. 1.

    Host code: Drives the application, including creating and managing data buffers and submitting work to queues to form new nodes in the task graph for asynchronous execution.

     
  2. 2.

    Host code within a command group: This code is run on the processor that the host code is executing on and executes immediately, before the submit call returns. This code sets up the node dependences by creating accessors, for example. Any arbitrary CPU code can execute here, but best practice is to restrict it to code that configures the node dependences.

     
  3. 3.

    An action: Any action listed in Figure 2-21 can be included in a command group, and it defines the work to be performed asynchronously in the future when node requirements are met (set up by (2)).

     
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig22_HTML.png
Figure 2-22

Submission of device code

To understand when code in an application will run, note that anything passed to an action listed in Figure 2-21 that initiates device code execution, or an explicit memory operation listed in Figure 2-21, will execute asynchronously in the future when the DAG node dependences have been met. All other code runs as part of the host program immediately, as expected in typical C++ code.

Fallback

Usually a command group is executed on the command queue to which we have submitted it. However, there may be cases where the command group fails to be submitted to a queue (e.g., when the requested size of work is too large for the device’s limits) or when a successfully submitted operation is unable to begin execution (e.g., when a hardware device has failed). To handle such cases, it is possible to specify a fallback queue for the command group to be executed on. The authors don’t recommend this error management technique because it offers little control, and instead we recommend catching and managing the initial error as is described in Chapter 5. We briefly cover the fallback queue here because some people prefer the style and it is a well-known part of SYCL.

This style of fallback is for failed queue submissions for devices that are present on the machine. This is not a fallback mechanism to solve the problem of an accelerator not being present. On a system with no GPU device, the program in Figure 2-23 will throw an error at the Q declaration (attempted construction) indicating that “No device of requested type available.”
../images/489625_1_En_2_Chapter/489625_1_En_2_Fig23_HTML.png
Figure 2-23

Fallback queue example

The topic of fallback based on devices that are present will be discussed in Chapter 12.

Figure 2-23 shows code that will fail to begin execution on some GPUs, due to the requested size of the work-group. We can specify a secondary queue as a parameter to the submit function, and this secondary queue (the host device in this case) is used if the command group fails to be enqueued to the primary queue.

The fallback queue is enabled by passing a secondary queue to a submit call. The authors recommend catching the initial error and handling it, as described in Chapter 5, instead of using the fallback queue mechanism which offers less control.

Summary

In this chapter we provided an overview of queues, selection of the device with which a queue will be associated, and how to create custom device selectors. We also overviewed the code that executes on a device asynchronously when dependences are met vs. the code that executes as part of the C++ application host code. Chapter 3 describes how to control data movement.

Creative Commons

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.