We have spent the entire book promoting the art of writing our own code. Now we finally acknowledge that some great programmers have already written code that we can just use. Libraries are the best way to get our work done. This is not a case of being lazy—it is a case of having better things to do than reinvent the work of others. This is a puzzle piece worth having.
The open source DPC++ project includes some libraries. These libraries can help us continue to use libstdc++, libc++, and MSVC library functions even within our kernel code. The libraries are included as part of DPC++ and the oneAPI products from Intel. These libraries are not tied to the DPC++ compiler so they can be used with any SYCL compiler.
The DPC++ library provides an alternative for programmers who create heterogeneous applications and solutions. Its APIs are based on familiar standards—C++ STL, Parallel STL (PSTL), and SYCL—to provide high-productivity APIs to programmers. This can minimize programming effort across CPUs, GPUs, and FPGAs while leading to high-performance parallel applications that are portable.
The SYCL standard defines a rich set of built-in functions that provide functionality, for host and device code, worth considering as well. DPC++ and many SYCL implementations implement key math built-ins with math libraries.
The libraries and built-ins discussed within this chapter are compiler agnostic. In other words, they are equally applicable to DPC++ compilers or SYCL compilers. The fpga_device_policy class is a DPC++ feature for FPGA support.
Since there is overlap in naming and functionality, this chapter will start with a brief introduction to the SYCL built-in functions.
Built-In Functions
Floating-point math functions: asin, acos, log, sqrt, floor, etc. listed in Figure 18-2.
Integer functions: abs, max, min, etc. listed in Figure 18-3.
Common functions: clamp, smoothstep, etc. listed in Figure 18-4.
Geometric functions: cross, dot, distance, etc. listed in Figure 18-5.
Relational functions: isequal, isless, isfinite, etc. listed in Figure 18-6.
In addition to the data types supported in SYCL, the DPC++ device library provides support for std:complex as a data type and the corresponding math functions defined in the C++ std library.
Use the sycl:: Prefix with Built-In Functions
The SYCL built-in functions should be invoked with an explicit sycl:: prepended to the name. With the current SYCL specification, calling just sqrt() is not guaranteed to invoke the SYCL built-in on all implementations even if “using namespace sycl;” has been used.
SYCL built-in functions should always be invoked with an explicit sycl:: in front of the built-in name. Failure to follow this advice may result in strange and non-portable results.
DPC++ Library
A set of tested C++ standard APIs—we simply need to include the corresponding C++ standard header files and use the std namespace.
Parallel STL that includes corresponding header files. We simply use #include <dpstd/...> to include them. The DPC++ library uses namespace dpstd for the extended API classes and functions.
Standard C++ APIs in DPC++
Figure 18-8 lists C++ standard APIs with “Y” to indicate those that have been tested for use in DPC++ kernels for CPU, GPU, and FPGA devices, at the time of this writing. A blank indicates incomplete coverage (not all three device types) at the time of publication for this book. A table is also included as part of the online DPC++ language reference guide and will be updated over time—the library support in DPC++ will continue to expand its support.
The tested standard C++ APIs are supported in libstdc++ (GNU) with gcc 7.4.0 and libc++ (LLVM) with clang 10.0 and MSVC Standard C++ Library with Microsoft Visual Studio 2017 for the host CPU as well.
On Linux, GNU libstdc++ is the default C++ standard library for the DPC++ compiler, so no compilation or linking option is required. If we want to use libc++, use the compile options -stdlib=libc++ -nostdinc++ to leverage libc++ and to not include C++ std headers from the system. The DPC++ compiler has been verified using libc++ in DPC++ kernels on Linux, but the DPC++ runtime needs to be rebuilt with libc++ instead of libstdc++. Details are in https://intel.github.io/llvm-docs/GetStartedGuide.html#build-dpc-toolchain-with-libc-library. Because of these extra steps, libc++ is not the recommended C++ standard library for us to use in general.
On FreeBSD, libc++ is the default standard library, and the -stdlib=libc++ option is not required. More details are in https://libcxx.llvm.org/docs/UsingLibcxx.html. On Windows, only the MSVC C++ library can be used.
To achieve cross-architecture portability, if a std function is not marked with “Y” in Figure 18-8, we need to keep portability in mind when we write device functions!
DPC++ Parallel STL
Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the ISO/IEC 14882:2017 standard, commonly called C++17. The existing implementation also supports the unsequenced execution policy specified in Parallelism TS version 2 and proposed for the next version of the C++ standard in the C++ working group paper P1001R1.
When using algorithms and execution policies, specify the namespace std::execution if there is no vendor-specific implementation of the C++17 standard library or pstl::execution otherwise.
Execution Policy | Meaning |
---|---|
seq | Sequential execution. |
unseq | Unsequenced SIMD execution. This policy requires that all functions provided are safe to execute in SIMD. |
par | Parallel execution by multiple threads. |
par_unseq | Combined effect of unseq and par. |
Parallel STL for DPC++ is extended with support for DPC++ devices using special execution policies. The DPC++ execution policy specifies where and how a Parallel STL algorithm runs. It inherits a standard C++ execution policy, encapsulates a SYCL device or queue, and allows us to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies according to the C++17 standard.
DPC++ Execution Policy
- 1.
Add #include <dpstd/execution> into our code.
- 2.Create a policy object by providing a standard policy type, a class type for a unique kernel name as a template argument (optional), and one of the following constructor arguments:
A SYCL queue
A SYCL device
A SYCL device selector
An existing policy object with a different kernel name
- 3.
Pass the created policy object to a Parallel STL algorithm.
A dpstd::execution::default_policy object is a predefined device_policy created with a default kernel name and default queue. This can be used to create custom policy objects or passed directly when invoking an algorithm if the default choices are sufficient.
FPGA Execution Policy
- 1.
Define the _PSTL_FPGA_DEVICE macro to run on FPGA devices and additionally _PSTL_FPGA_EMU to run on an FPGA emulation device.
- 2.
Add #include <dpstd/execution> to our code.
- 3.Create a policy object by providing a class type for a unique kernel name and an unroll factor (see Chapter 17) as template arguments (both optional) and one of the following constructor arguments:
A SYCL queue constructed for the FPGA selector (the behavior is undefined with any other device type)
An existing FPGA policy object with a different kernel name and/or unroll factor
- 4.
Pass the created policy object to a Parallel STL algorithm.
The default constructor of fpga_device_policy creates an object with a SYCL queue constructed for fpga_selector, or for fpga_emulator_selector if _PSTL_FPGA_EMU is defined.
dpstd::execution::fpga_policy is a predefined object of the fpga_device_policy class created with a default kernel name and default unroll factor. Use it to create customized policy objects or pass it directly when invoking an algorithm.
Code in Figure 18-10 assumes using namespace dpstd::execution; for policies and using namespace sycl; for queues and device selectors.
Using DPC++ Parallel STL
#include <dpstd/algorithm>
#include <dpstd/numeric>
#include <dpstd/memory>
dpstd::begin and dpstd::end are special helper functions that allow us to pass SYCL buffers to Parallel STL algorithms. These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements:
Is CopyConstructible, CopyAssignable, and comparable with operators == and !=.
The following expressions are valid: a + n, a – n, and a – b, where a and b are objects of the type and n is an integer value.
Has a get_buffer method with no arguments. The method returns the SYCL buffer passed to dpstd::begin and dpstd::end functions.
Parallel STL algorithms can be called with ordinary (host-side) iterators, as seen in the code example in Figure 18-11.
In this case, a temporary SYCL buffer is created, and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working directly with existing SYCL buffers, where possible, is recommended to reduce data movement between the host and device and any unnecessary overhead of buffer creations and destructions.
Figure 18-13 shows an example which performs a binary search of the input sequence for each of the values in the search sequence provided. As the result of a search for the ith element of the search sequence, a Boolean value indicating whether the search value was found in the input sequence is assigned to the ith element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. The algorithm assumes that the input sequence has been sorted by the comparator provided. If no comparator is provided, then a function object that uses operator< to compare the elements will be used.
The complexity of the preceding description highlights that we should leverage library functions where possible, instead of writing our own implementations of similar algorithms which may take significant debugging and tuning time. Authors of the libraries that we can take advantage of are often experts in the internals of the device architectures to which they are coding, and may have access to information that we do not, so we should always leverage optimized libraries when they are available.
Create DPC++ iterators.
Create a named policy from an existing policy.
Invoke the parallel algorithm.
The example in Figure 18-13 uses the dpstd::binary_search algorithm to perform binary search on a CPU, GPU, or FPGA, based on our device selection.
Using Parallel STL with USM
Through USM pointers
Through USM allocators
If we have a USM allocation, we can pass the pointers to the start and (one past the) end of the allocation to a parallel algorithm. It is important to be sure that the execution policy and the allocation itself were created for the same queue or context, to avoid undefined behavior at runtime.
If the same allocation is to be processed by several algorithms, either use an in-order queue or explicitly wait for completion of each algorithm before using the same allocation in the next one (this is typical operation ordering when using USM). Also wait for completion before accessing the data on the host, as shown in Figure 18-14.
Error Handling with DPC++ Execution Policies
As detailed in Chapter 5, the DPC++ error handling model supports two types of errors. With synchronous errors, the runtime throws exceptions, while asynchronous errors are only processed in a user-supplied error handler at specified times during program execution.
No exceptions are thrown explicitly by algorithms.
Exceptions thrown by the runtime on the host CPU, including DPC++ synchronous exceptions, are passed through to the caller.
DPC++ asynchronous errors are not handled by the Parallel STL, so must be handled (if any handling is desired) by the calling application.
To process DPC++ asynchronous errors, the queue associated with a DPC++ policy must be created with an error handler object. The predefined policy objects (default_policy and others) have no error handlers, so we should create our own policies if we need to process asynchronous errors.
Summary
The DPC++ library is a companion to the DPC++ compiler. It helps us with solutions for portions of our heterogeneous applications, using pre-built and tuned libraries for common functions and parallel patterns. The DPC++ library allows explicit use of the C++ STL API within kernels, it streamlines cross-architecture programming with Parallel STL algorithm extensions, and it increases the successful application of parallel algorithms with custom iterators. In addition to support for familiar libraries (libstdc++, libc++, MSVS), DPC++ also provides full support for SYCL built-in functions. This chapter overviewed options for leveraging the work of others instead of having to write everything ourselves, and we should use that approach wherever practical to simplify application development and often to realize superior performance.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.