J. Reinders et al.Data Parallel C++https://doi.org/10.1007/978-1-4842-5574-2_18

18. Libraries

James Reinders¹, Ben Ashbaugh², James Brodman³, Michael Kinsner⁴, John Pennycook⁵ and Xinmin Tian⁶

(1)

Beaverton, OR, USA

(2)

Folsom, CA, USA

(3)

Marlborough, MA, USA

(4)

Halifax, NS, Canada

(5)

San Jose, CA, USA

(6)

Fremont, CA, USA

../images/489625_1_En_18_Chapter/489625_1_En_18_Figa_HTML.gif

We have spent the entire book promoting the art of writing our own code. Now we finally acknowledge that some great programmers have already written code that we can just use. Libraries are the best way to get our work done. This is not a case of being lazy—it is a case of having better things to do than reinvent the work of others. This is a puzzle piece worth having.

The open source DPC++ project includes some libraries. These libraries can help us continue to use libstdc++, libc++, and MSVC library functions even within our kernel code. The libraries are included as part of DPC++ and the oneAPI products from Intel. These libraries are not tied to the DPC++ compiler so they can be used with any SYCL compiler.

The DPC++ library provides an alternative for programmers who create heterogeneous applications and solutions. Its APIs are based on familiar standards—C++ STL, Parallel STL (PSTL), and SYCL—to provide high-productivity APIs to programmers. This can minimize programming effort across CPUs, GPUs, and FPGAs while leading to high-performance parallel applications that are portable.

The SYCL standard defines a rich set of built-in functions that provide functionality, for host and device code, worth considering as well. DPC++ and many SYCL implementations implement key math built-ins with math libraries.

The libraries and built-ins discussed within this chapter are compiler agnostic. In other words, they are equally applicable to DPC++ compilers or SYCL compilers. The fpga_device_policy class is a DPC++ feature for FPGA support.

Since there is overlap in naming and functionality, this chapter will start with a brief introduction to the SYCL built-in functions.

Built-In Functions

DPC++ provides a rich set of SYCL built-in functions with respect to various data types. These built-in functions are available in the sycl namespace on host and device with low-, medium-, and high-precision support for the target devices based on compiler options, for example, the -mfma, -ffast-math, and -ffp-contract=fast provided by the DPC++ compiler. These built-in functions on host and device can be classified as in the following:

Floating-point math functions: asin, acos, log, sqrt, floor, etc. listed in Figure 18-2.
Integer functions: abs, max, min, etc. listed in Figure 18-3.
Common functions: clamp, smoothstep, etc. listed in Figure 18-4.
Geometric functions: cross, dot, distance, etc. listed in Figure 18-5.
Relational functions: isequal, isless, isfinite, etc. listed in Figure 18-6.

If a function is provided by the C++ std library, as listed in Figure 18-8, as well as a SYCL built-in function, then DPC++ programmers are allowed to use either. Figure 18-1 demonstrates the C++ std::log function and SYCL built-in sycl::log function for host and device, and both functions produce the same numeric results. In the example, the built-in relational function sycl::isequal is used to compare the results of std:log and sycl:log .

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig1_HTML.png — Figure 18-1
Using std::log and sycl::*log*

In addition to the data types supported in SYCL, the DPC++ device library provides support for std:complex as a data type and the corresponding math functions defined in the C++ std library.

Use the sycl:: Prefix with Built-In Functions

The SYCL built-in functions should be invoked with an explicit sycl:: prepended to the name. With the current SYCL specification, calling just sqrt() is not guaranteed to invoke the SYCL built-in on all implementations even if “using namespace sycl;” has been used.

SYCL built-in functions should always be invoked with an explicit sycl:: in front of the built-in name. Failure to follow this advice may result in strange and non-portable results.

If a built-in function name conflicts with a non-templated function in our application, in many implementations (including DPC++), our function will prevail, thanks to C++ overload resolution rules that prefer a non-templated function over a templated one. However, if our code has a function name that is the same as a built-in name, the most portable thing to do is either avoid using namespace sycl; or make sure no actual conflict happens. Otherwise, some SYCL compilers will refuse to compile the code due to an unresolvable conflict within their implementation. Such a conflict will not be silent. Therefore, if our code compiles today, we can safely ignore the possibility of future problems.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig2_HTML.png — Figure 18-2
Built-in math functions

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig3a_HTML.png — Figure 18-3
Built-in integer functions

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig4_HTML.png — Figure 18-4
Built-in common functions

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig5_HTML.png — Figure 18-5
Built-in geometric functions

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig6_HTML.png — Figure 18-6
Built-in relational functions

DPC++ Library

The DPC++ library consists of the following components:

A set of tested C++ standard APIs—we simply need to include the corresponding C++ standard header files and use the std namespace.
Parallel STL that includes corresponding header files. We simply use #include <dpstd/...> to include them. The DPC++ library uses namespace dpstd for the extended API classes and functions.

Standard C++ APIs in DPC++

The DPC++ library contains a set of tested standard C++ APIs. The basic functionality for a number of C++ standard APIs has been developed so that these APIs can be employed in device kernels similar to how they are employed in code for a typical C++ host application. Figure 18-7 shows an example of how to use std::swap in device code.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig7_HTML.png — Figure 18-7
Using std::swap in device code

We can use the following command to build and run the program (assuming it resides in the stdswap.cpp file):

dpcpp –std=c++17 stdswap.cpp –o stdswap.exe

./stdswap.exe

The printed result is:

8, 9

9, 8

Figure 18-8 lists C++ standard APIs with “Y” to indicate those that have been tested for use in DPC++ kernels for CPU, GPU, and FPGA devices, at the time of this writing. A blank indicates incomplete coverage (not all three device types) at the time of publication for this book. A table is also included as part of the online DPC++ language reference guide and will be updated over time—the library support in DPC++ will continue to expand its support.

In the DPC++ library, some C++ std functions are implemented based on their corresponding built-in functions on the device to achieve the same level of performance as the SYCL versions of these functions.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig8a_HTML.png — Figure 18-8
Library support with CPU/GPU/FPGA coverage (at time of book publication)

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig8b_HTML.png — Figure 18-8
Library support with CPU/GPU/FPGA coverage (at time of book publication)

The tested standard C++ APIs are supported in libstdc++ (GNU) with gcc 7.4.0 and libc++ (LLVM) with clang 10.0 and MSVC Standard C++ Library with Microsoft Visual Studio 2017 for the host CPU as well.

On Linux, GNU libstdc++ is the default C++ standard library for the DPC++ compiler, so no compilation or linking option is required. If we want to use libc++, use the compile options -stdlib=libc++ -nostdinc++ to leverage libc++ and to not include C++ std headers from the system. The DPC++ compiler has been verified using libc++ in DPC++ kernels on Linux, but the DPC++ runtime needs to be rebuilt with libc++ instead of libstdc++. Details are in https://intel.github.io/llvm-docs/GetStartedGuide.html#build-dpc-toolchain-with-libc-library. Because of these extra steps, libc++ is not the recommended C++ standard library for us to use in general.

On FreeBSD, libc++ is the default standard library, and the -stdlib=libc++ option is not required. More details are in https://libcxx.llvm.org/docs/UsingLibcxx.html. On Windows, only the MSVC C++ library can be used.

To achieve cross-architecture portability, if a std function is not marked with “Y” in Figure 18-8, we need to keep portability in mind when we write device functions!

DPC++ Parallel STL

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the ISO/IEC 14882:2017 standard, commonly called C++17. The existing implementation also supports the unsequenced execution policy specified in Parallelism TS version 2 and proposed for the next version of the C++ standard in the C++ working group paper P1001R1.

When using algorithms and execution policies, specify the namespace std::execution if there is no vendor-specific implementation of the C++17 standard library or pstl::execution otherwise.

For any of the implemented algorithms, we can pass one of the values seq, unseq, par, or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meanings:

Execution Policy	Meaning
seq	Sequential execution.
unseq	Unsequenced SIMD execution. This policy requires that all functions provided are safe to execute in SIMD.
par	Parallel execution by multiple threads.
par_unseq	Combined effect of unseq and par.

Parallel STL for DPC++ is extended with support for DPC++ devices using special execution policies. The DPC++ execution policy specifies where and how a Parallel STL algorithm runs. It inherits a standard C++ execution policy, encapsulates a SYCL device or queue, and allows us to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies according to the C++17 standard.

DPC++ Execution Policy

Currently, only the parallel unsequenced policy (par_unseq) is supported by the DPC++ library. In order to use the DPC++ execution policy, there are three steps:

1.
Add #include <dpstd/execution> into our code.
2.
Create a policy object by providing a standard policy type, a class type for a unique kernel name as a template argument (optional), and one of the following constructor arguments:
- A SYCL queue
- A SYCL device
- A SYCL device selector
- An existing policy object with a different kernel name
3.
Pass the created policy object to a Parallel STL algorithm.

A dpstd::execution::default_policy object is a predefined device_policy created with a default kernel name and default queue. This can be used to create custom policy objects or passed directly when invoking an algorithm if the default choices are sufficient.

Figure 18-9 shows examples that assume use of the using namespace dpstd::execution; directive when referring to policy classes and functions.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig9_HTML.png — Figure 18-9
Creating execution policies

FPGA Execution Policy

The fpga_device_policy class is a DPC++ policy tailored to achieve better performance of parallel algorithms on FPGA hardware devices. Use the policy when running the application on FPGA hardware or an FPGA emulation device:

1.
Define the _PSTL_FPGA_DEVICE macro to run on FPGA devices and additionally _PSTL_FPGA_EMU to run on an FPGA emulation device.
2.
Add #include <dpstd/execution> to our code.
3.
Create a policy object by providing a class type for a unique kernel name and an unroll factor (see Chapter 17) as template arguments (both optional) and one of the following constructor arguments:
- A SYCL queue constructed for the FPGA selector (the behavior is undefined with any other device type)
- An existing FPGA policy object with a different kernel name and/or unroll factor
4.
Pass the created policy object to a Parallel STL algorithm.

The default constructor of fpga_device_policy creates an object with a SYCL queue constructed for fpga_selector, or for fpga_emulator_selector if _PSTL_FPGA_EMU is defined.

dpstd::execution::fpga_policy is a predefined object of the fpga_device_policy class created with a default kernel name and default unroll factor. Use it to create customized policy objects or pass it directly when invoking an algorithm.

Code in Figure 18-10 assumes using namespace dpstd::execution; for policies and using namespace sycl; for queues and device selectors.

Specifying an unroll factor for a policy enables loop unrolling in the implementation of algorithms. The default value is 1. To find out how to choose a better value, see Chapter 17.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig10_HTML.png — Figure 18-10
Using FPGA policy

Using DPC++ Parallel STL

In order to use the DPC++ Parallel STL, we need to include Parallel STL header files by adding a subset of the following set of lines. These lines are dependent on the algorithms we intend to use:

#include <dpstd/algorithm>
#include <dpstd/numeric>
#include <dpstd/memory>

dpstd::begin and dpstd::end are special helper functions that allow us to pass SYCL buffers to Parallel STL algorithms. These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements:

Is CopyConstructible, CopyAssignable, and comparable with operators == and !=.
The following expressions are valid: a + n, a – n, and a – b, where a and b are objects of the type and n is an integer value.
Has a get_buffer method with no arguments. The method returns the SYCL buffer passed to dpstd::begin and dpstd::end functions.

To use these helper functions, add #include <dpstd/iterators> to our code. See the code in Figures 18-11 and 18-12 using the std::fill function as examples that use the begin/end helpers.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig11_HTML.png — Figure 18-11
Using std::fill

REDUCE DATA COPYING BETWEEN THE HOST AND DEVICE

Parallel STL algorithms can be called with ordinary (host-side) iterators, as seen in the code example in Figure 18-11.

In this case, a temporary SYCL buffer is created, and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working directly with existing SYCL buffers, where possible, is recommended to reduce data movement between the host and device and any unnecessary overhead of buffer creations and destructions.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig12_HTML.png — Figure 18-12
Using std::fill with default policy

Figure 18-13 shows an example which performs a binary search of the input sequence for each of the values in the search sequence provided. As the result of a search for the i^th element of the search sequence, a Boolean value indicating whether the search value was found in the input sequence is assigned to the i^th element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. The algorithm assumes that the input sequence has been sorted by the comparator provided. If no comparator is provided, then a function object that uses operator< to compare the elements will be used.

The complexity of the preceding description highlights that we should leverage library functions where possible, instead of writing our own implementations of similar algorithms which may take significant debugging and tuning time. Authors of the libraries that we can take advantage of are often experts in the internals of the device architectures to which they are coding, and may have access to information that we do not, so we should always leverage optimized libraries when they are available.

The code example shown in Figure 18-13 demonstrates the three typical steps when using a DPC++ Parallel STL algorithm:

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig13_HTML.png — Figure 18-13
Using binary_search

Create DPC++ iterators.
Create a named policy from an existing policy.
Invoke the parallel algorithm.

The example in Figure 18-13 uses the dpstd::binary_search algorithm to perform binary search on a CPU, GPU, or FPGA, based on our device selection.

Using Parallel STL with USM

The following examples describe two ways to use the Parallel STL algorithms in combination with USM:

Through USM pointers
Through USM allocators

If we have a USM allocation, we can pass the pointers to the start and (one past the) end of the allocation to a parallel algorithm. It is important to be sure that the execution policy and the allocation itself were created for the same queue or context, to avoid undefined behavior at runtime.

If the same allocation is to be processed by several algorithms, either use an in-order queue or explicitly wait for completion of each algorithm before using the same allocation in the next one (this is typical operation ordering when using USM). Also wait for completion before accessing the data on the host, as shown in Figure 18-14.

Alternatively, we can use std::vector with a USM allocator as shown in Figure 18-15.

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig14_HTML.png — Figure 18-14
Using Parallel STL with a USM pointer

../images/489625_1_En_18_Chapter/489625_1_En_18_Fig15_HTML.png — Figure 18-15
Using Parallel STL with a USM allocator

Error Handling with DPC++ Execution Policies

As detailed in Chapter 5, the DPC++ error handling model supports two types of errors. With synchronous errors, the runtime throws exceptions, while asynchronous errors are only processed in a user-supplied error handler at specified times during program execution.

For Parallel STL algorithms executed with DPC++ policies, handling of all errors, synchronous or asynchronous, is a responsibility of the caller. Specifically

No exceptions are thrown explicitly by algorithms.
Exceptions thrown by the runtime on the host CPU, including DPC++ synchronous exceptions, are passed through to the caller.
DPC++ asynchronous errors are not handled by the Parallel STL, so must be handled (if any handling is desired) by the calling application.

To process DPC++ asynchronous errors, the queue associated with a DPC++ policy must be created with an error handler object. The predefined policy objects (default_policy and others) have no error handlers, so we should create our own policies if we need to process asynchronous errors.

Summary

The DPC++ library is a companion to the DPC++ compiler. It helps us with solutions for portions of our heterogeneous applications, using pre-built and tuned libraries for common functions and parallel patterns. The DPC++ library allows explicit use of the C++ STL API within kernels, it streamlines cross-architecture programming with Parallel STL algorithm extensions, and it increases the successful application of parallel algorithms with custom iterators. In addition to support for familiar libraries (libstdc++, libc++, MSVS), DPC++ also provides full support for SYCL built-in functions. This chapter overviewed options for leveraging the work of others instead of having to write everything ourselves, and we should use that approach wherever practical to simplify application development and often to realize superior performance.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.