© Intel Corporation 2021
J. Reinders et al.Data Parallel C++https://doi.org/10.1007/978-1-4842-5574-2_5

5. Error Handling

James Reinders1 , Ben Ashbaugh2, James Brodman3, Michael Kinsner4, John Pennycook5 and Xinmin Tian6
(1)
Beaverton, OR, USA
(2)
Folsom, CA, USA
(3)
Marlborough, MA, USA
(4)
Halifax, NS, Canada
(5)
San Jose, CA, USA
(6)
Fremont, CA, USA
 

../images/489625_1_En_5_Chapter/489625_1_En_5_Figa_HTML.gif

Agatha Christie wrote in 1969 that “human error is nothing to what a computer can do if it tries.” It is no mystery that we, as programmers, get to clean up the mess. The mechanisms for error handling could catch programmer errors that others may make. Since we do not plan on making mistakes ourselves, we can focus on using error handling to handle conditions that may occur in the real world from other causes.

Detecting and dealing with unexpected conditions and errors can be helpful during application development (think: the other programmer who works on the project who does make mistakes), but more importantly play a critical role in stable and safe production applications and libraries. We devote this chapter to describing the error handling mechanisms available in SYCL so that we can understand what our options are and how to architect applications if we care about detecting and managing errors.

This chapter overviews synchronous and asynchronous errors in SYCL, describes the behavior of an application if we do nothing in our code to handle errors, and dives into the SYCL-specific mechanism that allows us to handle asynchronous errors.

Safety First

A core aspect of C++ error handling is that if we do nothing to handle an error that has been detected (thrown), then the application will terminate and indicate that something went wrong. This behavior allows us to write applications without focusing on error management and still be confident that errors will somehow be signaled to a developer or user. We’re not suggesting that we should ignore error handling, of course! Production applications should be written with error management as a core part of the architecture, but applications often start development without such a focus. C++ aims to make code which doesn’t handle errors still able to observe errors, even when they are not dealt with explicitly.

Since SYCL is Data Parallel C++, the same philosophy holds: if we do nothing in our code to manage errors and an error is detected, an abnormal termination of the program will occur to let us know that something bad happened. Production applications should of course consider error management as a core part of the software architecture, not only reporting but often also recovering from error conditions.

If we don’t add any error management code and an error occurs, we will still see an abnormal program termination which is an indication to dig deeper.

Types of Errors

C++ provides a framework for notification and handling of errors through its exception mechanism. Heterogeneous programming requires an additional level of error management beyond this, because some errors occur on a device or when trying to launch work on a device. These errors are typically decoupled in time from the host program’s execution, and as such they don’t integrate cleanly with classic C++ exception handling mechanisms. To solve this, there are additional mechanisms to make asynchronous errors as manageable and controllable as regular C++ exceptions.

Figure 5-1 shows two components of a typical application: (1) the host code that runs sequentially and submits work to the task graph for future execution and (2) the task graph which runs asynchronously from the host program and executes kernels or other actions on devices when the necessary dependences are met. The example shows a parallel_for as the operation that executes asynchronously as part of the task graph, but other operations are possible as well as discussed in Chapters 3, 4, and 8.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig1_HTML.png
Figure 5-1

Separation of host program and task graph executions

The distinction between the left and right (host and task graph) sides of Figure 5-1 is the key to understanding the differences between synchronous and asynchronous errors.

Synchronous errors occur when an error condition can be detected as the host program executes an operation, such as an API call or object constructor. They can be detected before an instruction on the left side of the figure completes, and the error can be thrown by the operation that caused the error immediately. We can wrap specific instructions on the left side of the diagram with a try-catch construct, expecting that errors occurring as a result of operations within the try will be detected before the try block ends (and therefore caught). The C++ exception mechanism is designed to handle exactly these types of errors.

Alternatively, asynchronous errors occur as part of the right side of Figure 5-1, where an error is only detected when an operation in the task graph is executed. By the time that an asynchronous error is detected as part of task graph execution, the host program has typically already moved on with its execution, so there is no code to wrap with a try-catch construct to catch these errors. There is instead an asynchronous exception handling framework to handle these errors that occur at seemingly random times relative to host program execution.

Let’s Create Some Errors!

As examples for the remainder of this chapter and to allow us to experiment, we’ll create both synchronous and asynchronous errors in the following sections.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig2_HTML.png
Figure 5-2

Creating a synchronous error

Synchronous Error

In Figure 5-2, a sub-buffer is created from a buffer but with an illegal size (larger than the original buffer). The constructor of the sub-buffer detects this error and throws an exception before the constructor’s execution completes. This is a synchronous error because it occurs as part of (synchronously with) the host program’s execution. The error is detectable before the constructor returns, so the error may be handled immediately at its point of origin or detection in the host program.

Our code example doesn’t do anything to catch and handle C++ exceptions, so the default C++ uncaught exception handler calls std::terminate for us, signaling that something went wrong.

Asynchronous Error

Generating an asynchronous error is a bit trickier because implementations work hard to detect and report errors synchronously whenever possible. Synchronous errors are easier to debug because they occur at a specific point of origin in the host program, so are preferred whenever possible. One way to generate an asynchronous error for our demonstration purpose, though, is to add a fallback/secondary queue to our command group submission and to discard synchronous exceptions that also happen to be thrown. Figure 5-3 shows such code which invokes our handle_async_error function to allow us to experiment. Asynchronous errors can occur and be reported without a secondary/fallback queue, so note that the secondary queue is only part of the example and in no way a requirement for asynchronous errors.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig3_HTML.png
Figure 5-3

Creating an asynchronous error

Application Error Handling Strategy

The C++ exception features are designed to cleanly separate the point in a program where an error is detected from the point where it may be handled, and this concept fits very well with both synchronous and asynchronous errors in SYCL. Through the throw and catch mechanisms , a hierarchy of handlers can be defined which can be important in production applications.

Building an application that can handle errors in a consistent and reliable way requires a strategy up front and a resulting software architecture built for error management. C++ provides flexible tools to implement many alternative strategies, but such architecture is beyond the scope of this chapter. There are many books and other references devoted to this topic, so we encourage looking to them for full coverage of C++ error management strategies.

This said, error detection and reporting doesn’t always need to be production-scale. Errors in a program can be reliably detected and reported through minimal code if the goal is simply to detect errors during execution and to report them (but not necessarily to recover from them). The following sections cover first what happens if we ignore error handling and do nothing (the default behavior isn’t all that bad!), followed by recommended error reporting that is simple to implement in basic applications.

Ignoring Error Handling

C++ and SYCL are designed to tell us that something went wrong even when we don’t handle errors explicitly. The default result of unhandled synchronous or asynchronous errors is abnormal program termination which an operating system should tell us about. The following two examples mimic the behavior that will occur if we do not handle a synchronous and an asynchronous error, respectively.

Figure 5-4 shows the result of an unhandled C++ exception, which could be an unhandled SYCL synchronous error, for example. We can use this code to test what a particular operating system will report in such a case.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig4_HTML.png
Figure 5-4

Unhandled exception in C++

Figure 5-5 shows example output from std::terminate being called, which will be the result of an unhandled SYCL asynchronous error in our application. We can use this code to test what a particular operating system will report in such a case.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig5_HTML.png
Figure 5-5

std::terminate is called when a SYCL asynchronous exception isn’t handled

Although we probably should handle errors in our programs, since uncaught errors will be caught and the program terminated, we do not need to worry about a program silently failing!

Synchronous Error Handling

We keep this section very short because SYCL synchronous errors are just C++ exceptions. Most of the additional error mechanisms added in SYCL relate to asynchronous errors which we cover in the next section, but synchronous errors are important because implementations try to detect and report as many errors synchronously as possible, since they are easier to reason about and handle.

Synchronous errors defined by SYCL are a derived class from std::exception of type sycl::exception, which allows us to catch the SYCL errors specifically though a try-catch structure such as what we see in Figure 5-6.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig6_HTML.png
Figure 5-6

Pattern to catch sycl::exception specifically

On top of the C++ error handling mechanisms, SYCL adds a sycl::exception type for the exceptions thrown by the runtime. Everything else is standard C++ exception handling, so will be familiar to most developers.

A slightly more complete example is provided in Figure 5-7, where additional classes of exception are handled , as well as the program being ended by returning from main().
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig7_HTML.png
Figure 5-7

Pattern to catch exceptions from a block of code

Asynchronous Error Handling

Asynchronous errors are detected by the SYCL runtime (or an underlying backend), and the errors occur independently of execution of commands in the host program. The errors are stored in lists internal to the SYCL runtime and only released for processing at specific points that the programmer can control. There are two topics that we need to discuss to cover handling of asynchronous errors:
  1. 1.

    The asynchronous handler that is invoked when there are outstanding asynchronous errors to process

     
  2. 2.

    When the asynchronous handler is invoked

     

The Asynchronous Handler

The asynchronous handler is a function that the application defines, which is registered with SYCL contexts and/or queues. At the times defined by the next section, if there are any unprocessed asynchronous exceptions that are available to be handled, then the asynchronous handler is invoked by the SYCL runtime and passed a list of these exceptions.

The asynchronous handler is passed to a context or queue constructor as a std::function and can be defined in ways such as a regular function, lambda, or functor, depending on our preference. The handler must accept a sycl::exception_list argument , such as in the example handler shown in Figure 5-8.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig8_HTML.png
Figure 5-8

Example asynchronous handler implementation defined as a lambda

In Figure 5-8, the std::rethrow_exception followed by catch of a specific exception type provides filtering of the type of exception, in this case to the only sycl::exception. We can also use alternative filtering approaches in C++ or just choose to handle all exceptions regardless of the type.

The handler is associated with a queue or context (low-level detail covered more in Chapter 6) at construction time. For example, to register the handler defined in Figure 5-8 with a queue that we are creating, we could write
queue my_queue{ gpu_selector{}, handle_async_error };
Likewise, to register the handler defined in Figure 5-8 with a context that we are creating, we could write
context my_context{ handle_async_error };

Most applications do not need contexts to be explicitly created or managed (they are created behind the scenes for us automatically), so if an asynchronous handler is going to be used, most developers should associate such handlers with queues that are being constructed for specific devices (and not explicit contexts).

In defining asynchronous handlers, most developers should define them on queues unless already explicitly managing contexts for other reasons.

If an asynchronous handler is not defined for a queue or the queue’s parent context and an asynchronous error occurs on that queue (or in the context) that must be processed, then the default asynchronous handler is invoked. The default handler operates as if it was coded as shown in Figure 5-9.
../images/489625_1_En_5_Chapter/489625_1_En_5_Fig9_HTML.png
Figure 5-9

Example of how the default asynchronous handler behaves

The default handler should display some information to the user on any errors in the exception list and then will terminate the application abnormally, which should also cause the operating system to report that termination was abnormal.

What we put within an asynchronous handler is up to us. It can range from logging of an error to application termination to recovery of the error condition so that an application can continue executing normally. The common case is to report any details of the error available by calling sycl::exception::what(), followed by termination of the application.

Although it’s up to us to decide what an asynchronous handler does internally, a common mistake is to print an error message (that may be missed in the noise of other messages from the program), followed by completion of the handler function. Unless we have error management principles in place that allow us to recover known program state and to be confident that it’s safe to continue execution, we should consider terminating the application within our asynchronous handler function(s). This reduces the chance that incorrect results will appear from a program where an error was detected, but where the application was inadvertently allowed to continue with execution regardless. In many programs, abnormal termination is the preferred result once we have experienced asynchronous exceptions.

Consider terminating applications within an asynchronous handler, after outputting information about the error, if comprehensive error recovery and management mechanisms are not in place.

Invocation of the Handler

The asynchronous handler is called by the runtime at specific times. Errors aren’t reported immediately as they occur because management of errors and safe application programming (particularly multithreaded) would become more difficult and expensive if that was the case. The asynchronous handler is instead called at the following very specific times:
  1. 1.

    When the host program calls queue::throw_asynchronous() on a specific queue

     
  2. 2.

    When the host program calls queue::wait_and_throw() on a specific queue

     
  3. 3.

    When the host program calls event::wait_and_throw() on a specific event

     
  4. 4.

    When a queue is destroyed

     
  5. 5.

    When a context is destroyed

     

Methods 1–3 provide a mechanism for a host program to control when asynchronous exceptions are handled, so that thread safety and other details specific to an application can be managed. They effectively provide controlled points at which asynchronous exceptions enter the host program control flow and can be processed almost as if they were synchronous errors.

If a user doesn’t explicitly call one of the methods 1–3, then asynchronous errors are commonly reported during program teardown when queues and contexts are destroyed. This is often enough to signal to a user that something went wrong and that program results shouldn’t be trusted.

Relying on error detection during program teardown doesn’t work in all cases, though. For example, if a program will only terminate when some algorithm convergence criteria are achieved and if those criteria are only achievable by successful execution of device kernels, then an asynchronous exception may signal that the algorithm will never converge and begin the teardown (where the error would be noticed). In these cases, and also in production applications where more complete error handling strategies are in place, it makes sense to invoke throw_asynchronous() or wait_and_throw() at regular and controlled points in the program (e.g., before checking whether algorithm convergence has occurred).

Errors on a Device

The error detection and handling mechanisms discussed in this chapter have been host-based. They are mechanisms through which the host program can detect and deal with something that may have gone wrong either in the host program or potentially during execution of kernels on devices. What we have not covered is how to signal, from within the device code that we write, that something has gone wrong. This omission is not a mistake, but quite intentional.

SYCL explicitly disallows C++ exception handling mechanisms (such as throw) within device code, because there are performance costs for some types of device that we usually don’t want to pay. If we detect that something has gone wrong within our device code, we should signal the error using existing non-exception-based techniques. For example, we could write to a buffer that logs errors or return some invalid result from our numeric calculation that we define to mean that an error occurred. The right strategy in these cases is very application specific.

Summary

In this chapter, we introduced synchronous and asynchronous errors, covered the default behavior to expect if we do nothing to manage errors that might occur, and covered the mechanisms used to handle asynchronous errors at controlled points in our application. Error management strategies are a major topic in software engineering and a significant percentage of the code written in many applications. SYCL integrates with the C++ knowledge that we already have when it comes to error handling and provides flexible mechanisms to integrate with whatever our preferred error management strategy is.

Creative Commons

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.