21.1 Goals Revisited
21.2 Memory Model Evolution
21.3 Kernel Execution Control Evolution
21.4 Core Performance
21.5 Programming Environment
21.6 Future Outlook
You made it! We have arrived at the finishing line. In this final chapter, we will briefly review the goals that we have achieved through this book. Instead of drawing a conclusion, we will offer our vision for the future evolution of massively parallel processor architectures and how the advancements will impact parallel application development.
As we stated in Chapter 1, our primary goal is to teach you, the readers, how to program massively parallel processors. We promised that it would become easy once you develop the right insight and go about it the right way. In particular, we promised to focus on computational thinking skills that would enable you to think about problems in ways that are amenable to parallel computing.
We delivered on these promises through an introduction to performance considerations for CUDA (Chapter 6), three parallel patterns (Chapters 8, 9, and 10), two detailed application case studies (Chapters 11 and 12), and a chapter dedicated to computational thinking skills (Chapter 13). Through this process, we introduced the pertinent computer architecture knowledge needed to understand the hardware limitations that must be addressed in high-performance parallel programming. In particular, we focused on the memory bandwidth limitations that will remain as the primary performance limiting factor in massively parallel computing systems (Chapters 4, 5, 6, 8, 9, 10, 11, 12, and 13). We also introduced the concept of floating-point precision/accuracy and numerical stability, and how they relate to parallel algorithms (Chapter 7). With these insights, high-performance parallel programming becomes a manageable process, rather than a black art.
We stated that our second goal was to teach high-performance parallel programming styles that naturally avoid subtle correctness issues. To deliver on this promise, we showed that the simple data-parallel CUDA programming model (Chapters 3 and 4) based on barrier synchronization can be used to develop very high-performance applications. This disciplined way of parallel programming naturally avoids the subtle race conditions that plague many other parallel programming systems.
We promised to teach parallel programming styles that transparently scale across future hardware generations, which will be more and more parallel. With the CUDA threading model (Chapter 4), a massive number of thread blocks can be executed in any order relative to each other. Your application will be able to benefit from more parallel hardware coming in the future. We also presented algorithm techniques, such as tiling and cutoff, that allow your application to scale naturally to very large data sets (Chapters 8, 9, 10, 11, 12, and 13).
We promised to teach the programming skills in such a way that you will be able to apply them to other programming models and languages. To help you branch out to other programming models, we introduced OpenCL (Chapter 14), OpenACC (Chapter 15), Thrust (Chapter 16), CUDA FORTRAN (Chapter 17), C++ AMD (Chapter 18), and MPI-CUDA (Chapter 19). In each chapter, we explained how the programming model/language relates to CUDA and how you can apply the skills you learned based on CUDA to these models/languages.
We hope that you have enjoyed the book.
Now that we have reviewed our promises, we would like to share our view of the coming evolution of the massively parallel processor architectures and how the advancements will likely impact application development. We hope that these outlooks will help you to peek into the future of parallel programming. Our comments are based on the new features in GPUs based on NVIDIA’s Kepler compute architecture that arrived at the market when this book went into press.
Large virtual and physical address spaces. GPUs have traditionally used only a physical address space with up to 32 address bits, which limited the GPU DRAM to 4 gigabytes or less. This is because graphics applications have not demanded more than a few hundred megabytes of frame buffer and texture memory. This is in contrast to the 64-bit virtual space and 40+ bits of physical space that CPU programmers have been taking for granted for many years. However, more recent graphics applications have demanded more.
More recent GPU families such as Fermi and Kepler have adopted CPU-style virtual memory architecture with a 64-bit virtual address space and a physical address space of at least 40 bits. The obvious benefit is that Fermi and Kepler GPUs can incorporate more than 4 gigabytes of DRAM and that CUDA kernels can now operate on very large data sets, whether hosted entirely in on-board GPU DRAM, or by accessing mapped host memory.
The Fermi virtual memory architecture also lays the foundation for a potentially profound enhancement to the programming model. The CPU system physical memory and the GPU physical memory can now be mapped within a single, shared virtual address space [GNS 2009]. A shared global address space allows all variables in an application to have unique addresses. Such memory architecture, when exposed by programming tools and a runtime system to applications, can result in several major benefits.
First, new runtime systems can be designed to allow CPUs and GPUs to access the entire volume of application data under traditional protection models. Such a capability would allow applications to use a single pointer system to access application variables, removing a confusing aspect of the current CUDA programming model where developers must not dereference a pointer to the device memory in host functions.
These variables can reside in the CPU physical memory, the GPU physical memory, or even both. The runtime and hardware can implement data migration and coherence support like the GMAC system [GNS 2009]. If a CPU function dereferences a pointer and accesses a variable mapped to the GPU physical memory, the data access would still be serviced, but perhaps at a longer latency. Such capability would allow the CUDA programs to more easily call legacy libraries that have not been ported to GPUs. In the current CUDA memory architecture, the developer must manually transfer data from the device memory to the host memory to use legacy library functions to process them on the CPU. GMAC is built on a current CUDA runtime API and gives the developer the option to either rely on the runtime system to service such accesses or to manually transfer data as a performance optimization. However, the GMAC system currently does not have a clean mechanism for supporting multiple GPUs. The new virtual memory capability would enable a much more elegant implementation.
Ultimately, the virtual memory capability will also enable a mechanism similar to the zero-copy feature in CUDA 2.2 to allow the GPU to directly access very large physical CPU system memories. In some application areas such as CAD, the CPU physical memory system may have hundreds of gigabytes of capacity. These physical memory systems are needed because the applications require the entire data set to be “in core.” It is currently infeasible for such applications to take advantage of GPU computing. With the ability to directly access very large CPU physical memories, it becomes feasible for GPUs to accelerate these applications.
The second potential benefit is that the shared global address space enables peer-to-peer direct data transfer between devices in a multidevice system. This is supported in CUDA 4.0 and later, using the GPUDirect™ feature. In older CUDA systems, devices must first transfer data to the host memory before delivering them to a peer device. A shared global address space enables the implementation of a runtime system to provide an API to directly transfer data from one device memory to another device memory. Ultimately, a runtime system can be designed to automate such transfers when devices reference data in each other’s memory, but still allow the use of explicit data transfer APIs as a performance optimization. In CUDA 5.0, it is possible not only to reference data on other GPUs within a multi-GPU system, but also data on GPUs on other local systems.
The third benefit is that one can implement I/O-related memory transfers directly in and out of the device memory. In older CUDA systems, I/O input data must first be transferred into the host memory before it can be copied into the device memory. The ability to directly transfer data in and out of the device memory can significantly reduce the copying cost and enhance the performance of applications that process large data sets.
Unified device memory space. In early CUDA memory models, constant memory, shared memory, local memory, and global memory form their own separate address spaces. The developer can use pointers into the global memory but not others. Starting with the Fermi architecture, these memories are parts of a unified address space. This makes it easier to abstract which memory contains a particular operand, allowing the programmer to deal with this only during allocation, and making it simpler to pass CUDA data objects into other procedures and functions, irrespective of which memory area they come from. It makes CUDA code modules much more “composable.” That is, a CUDA device function can now accept a pointer that may point to any of these memories. The code would run faster if a function argument pointer points to a shared memory location and slower if it points to a global memory location. The programmer can still perform manual data placement and transfers as a performance optimization. This capability will significantly reduce the cost of building production-quality CUDA libraries.
Configurable caching and scratchpad. The shared memory in early CUDA systems served as programmer-managed scratch memory and increased the speed of applications where key data structures have localized, predictable access patterns. Starting with the Fermi architecture, the shared memory has been enhanced to a larger on-chip memory that can be configured to be partially cache memory and partially shared memory, which allows coverage of both predictable and less predictable access patterns to benefit from on-chip memory. This configurability allows programmers to apportion the resources according to the best fit for their application.
Applications in an early design stage that are ported directly from CPU code will benefit greatly from caching as the dominant part of the on-chip memory. This would further smooth the performance tuning process by increasing the level of “easy performance” when a developer ports a CPU application to a GPU.
Existing CUDA applications and those that have predictable access patterns will have the ability to increase their use of fast shared memory by a factor of three while retaining the same device “occupancy” they had on previous generation devices. For CUDA applications of which the performance or capabilities are limited by the size of the shared memory, the three times increase in size will be a welcome improvement. For example, in stencil computation such as finite volume methods for computational fluid dynamics, the state loaded into the shared memory also includes “halo” elements from neighboring areas.
The relative portion of halo decreases as the size of the stencil increases. In 3D simulation models, the halo cells can be comparable in data size as the main data for current shared memory sizes. This can significantly reduce the effectiveness of the shared memory due to the significant portion of the memory bandwidth spent on loading the halo elements. For example, if the shared memory allows a thread block to load an 83 (= 512) cell stencil into the shared memory, with one layer of halo elements on every surface, only 63 (= 216), or less than half of the loaded cells, are the main data. The bandwidth spent on loading the halo elements is actually bigger than that spent on the main data. A three times increase in shared memory size allows some of these applications to have a more favorable stencil size where the halo accounts for a much lesser portion of the data in shared memory. In our example, the increased size would allow a 113 (= 1,331) tile to be loaded by each thread block. With one layer of halo elements on each surface, a total of 93 (= 729) cells, or more than half of the loaded elements, are main data. This significantly improves the memory bandwidth efficiency, and the performance of the application.
Enhanced atomic operations. The atomic operations in Fermi are much faster than those in previous CUDA systems, and the atomic operations in Kepler are still faster. In addition, the Kepler atomic operations are more general. Atomic operations are frequently used in random scatter computation patterns such as histograms. Faster atomic operations reduce the need for algorithm transformations such as prefix sum (Chapter 9) [SHZ 2007] and sorting [SHG 2009] for implementing such random scattering computations. These transformations tend to increase the number of kernel invocations needed to perform the target computation. Faster atomic operations can also reduce the need for involvement of the host CPU in algorithms that do collective operations or where multiple thread blocks update shared data structures, and thus reduce the data transfer pressure between the CPU and the GPU.
Enhanced global memory access. The speed of random memory access is much faster in Fermi and Kepler than earlier CUDA systems. Programmers can be less concerned about memory coalescing. This allows more CPU algorithms to be directly used in the GPU as an acceptable base, further smoothing the path of porting applications that access a diversity of data structures such as ray tracing, and other applications that are heavily object-oriented and may be difficult to convert into perfectly tiled arrays.
Function calls within kernel functions. Previous CUDA versions did not allow function calls in kernel code. Although the source code of kernel functions can appear to have function calls, the compiler must be able to inline all function bodies into the kernel object so that there is no function calls in the kernel function at runtime. Although this model works reasonably well for performance-critical portions of many applications, it does not support the software engineering practices in more sophisticated applications. In particular, it does not support system calls, dynamically linked library calls, recursive function calls, and virtual functions in object-oriented languages such as C++.
More recent device architectures such as Kepler support function calls in kernel functions at runtime. This feature is supported in CUDA 5.0 and later. The compiler is no longer required to inline the function bodies. It can still do so as a performance optimization. This capability is partly enabled by cached, fast implementation of massively parallel call frame stacks for CUDA threads. It makes CUDA device code much more “composable” by allowing different authors to write different CUDA kernel components and assemble them all together without heavy redesign costs. In particular, it allows modern object-oriented techniques such as virtual function calls, and software engineering practices such as dynamically linked libraries. It also allows software vendors to release device libraries without source code for intellectual property protection.
Support for function calls at runtime allows recursion and will significantly ease the burden on programmers as they transition from legacy CPU-oriented algorithms toward GPU-tuned approaches for divide-and-conquer types of computation. This also allows easier implementation of graph algorithms where data structure traversal often naturally involves recursion. In some cases, developers will be able to “cut and paste” CPU algorithms into a CUDA kernel and obtain a reasonably performing kernel, although continued performance tuning would still add benefit.
Exception handling in kernel functions. Early CUDA systems did not support exception handling in kernel code. While not a significant limitation for performance-critical portions of many high-performance applications, it often incurs software engineering costs in production-quality applications that rely on exceptions to detect and handle rare conditions without executing code to explicitly test for such conditions. Also, it does not allow kernel functions to utilize operating system services, which is typically avoided in performance-critical portions of the applications except during debugging situations.
With the availability of exception handling and function call support, kernels can now call standard library functions such as printf() and malloc(), which can lead to system call traps. In our experience, the ability to call printf() in the kernel provides a subtle but important aid in debugging and supporting kernels in production software. Many end users are nontechnical and cannot be easily trained to run debuggers to provide developers with more details on what happened before a crash. The ability to execute printf() in the kernel allows the developers to add a mode to the application to dump the internal state so that the end users can submit meaningful bug reports.
Simultaneous execution of multiple kernels. Previous CUDA systems allow only one kernel to execute on each GPU device at any point in time. Multiple kernel functions can be submitted for execution. However, they are buffered in a queue that releases the next kernel after the current one completes execution. Fermi and its successors allow multiple kernels from the same application to be executed simultaneously, which reduces the pressure for the application developer to “batch” multiple kernels into a larger kernel to more fully utilize a device. A typical example of benefit is for parallel cluster applications that segment work into “local” and “remote” partitions, where remote work is involved in interactions with other nodes and resides on the critical path of global progress. In previous CUDA systems, kernels needed to be large to keep the device running efficiently, and one had to be careful not to launch local work such that global work could be blocked. This meant choosing between underutilizing the device while waiting for remote work to arrive, or eagerly starting on local work to keep the device productive at the cost of increased latency for completing remote work units. With multiple kernel execution, the application can use much smaller kernel sizes for launching work, and as a result when high-priority remote work arrives, it can start running with low latency instead of being stuck behind a large kernel of local computation.
In Kepler and CUDA 5.0, the multiple kernel launch facility is extended by the addition of multiple hardware queues, which allow much more efficient scheduling of blocks from multiple kernels including kernels in multiple streams. In addition, the CUDA dynamic parallelism feature allows GPU work creation: GPU kernels can launch child kernels, asynchronously, dynamically, and in a data-dependent or compute load-dependent fashion. This reduces CPU–GPU interaction and synchronization, since the GPU can now manage more complex workloads independently. The CPU is in turn free to perform other useful computation.
Interruptable kernels. Fermi allows the running kernel to be “canceled,” which eases the creation of CUDA-accelerated applications that allow the user to abort a long-running calculation at any time, without requiring significant design effort on the part of the programmer. Once software support is available, this will enable implementation of user-level task scheduling systems that can better perform load balance between GPU nodes of a computing system, and allows more graceful handling of cases where one GPU is heavily loaded and may be running slower than its peers [SH 2009].
Double-precision speed. Early devices perform double-precision floating-point arithmetic with significant speed reduction (around eight times slower) compared to single precision. The floating-point arithmetic units of Fermi and its successors have been significantly strengthened to perform double-precision arithmetic at about half the speed of single precision. Applications that are intensive in double-precision floating-point arithmetic benefit tremendously. Other applications that use double precision carefully and sparingly see less performance impact.
In practice, the most significant benefit will likely be obtained by developers who are porting CPU-based numerical applications to GPUs. With the improved double-precision speed, they will have little incentive to spend the effort to evaluate whether their applications or portions of their applications can fit into single precision. This can significantly reduce the development cost for porting CPU applications to GPUs, and addresses a major criticism of GPUs by the high-performance computing community. Some applications that are operating on smaller size input data (8 bits, 16 bits, or single-precision floating point) may continue to benefit from using single-precision arithmetic, due to the reduced bandwidth of using 32-bit versus 64-bit data. Applications such as medical imaging, remote sensing, radio astronomy, seismic analysis, and other natural data frequently fit into this category.
Better control flow efficiency. Fermi adopts a general compiler-driven predication technique [MHM1995] that can more effectively handle control flow than previous CUDA systems. While this technique was moderately successful in VLIW systems, it can provide more dramatic speed improvements in GPU warp-style SIMD execution systems. This capability can potentially broaden the range of applications that can take advantage of GPUs. In particular, major performance benefits can potentially be realized for applications that are very data-driven, such as ray tracing, quantum chemistry visualization [SSH2009], and cellular automata simulation.
Future CUDA compilers will include enhanced support for C++ templates and virtual function calls in kernel functions. Although the hardware enhancements, such as the ability to make function calls at runtime, are in place, enhanced C++ language support in the compiler has been taking more time. The C++ try/catch features will also likely be fully supported in kernel functions in the near future. With these enhancements, future CUDA compilers will support most mainstream C++ features. The remaining features in kernel functions such as new, delete, constructors, and destructors will likely be available in later compiler releases.
New and evolved programming interfaces will continue to improve the productivity of heterogeneous parallel programmers. As we showed in Chapter 15, OpenACC allows developers to annotate their sequential loops with compiler directives to enable a compiler to generate CUDA kernels. In Chapter 16, we show that one can use the Thrust library of parallel type-generic functions, classes, and iterators to describe their computation and have the underlying mechanism to generate and configure the kernels that implement the computation. In Chapter 17, we presented CUDA FORTRAN that allows FORTRAN programmers to develop CUDA kernels in their familiar language. In particular, this interface offers strong support for indexing into multidimensional arrays. In Chapter 18, we gave an overview of the C++ AMP interface that allow the developers to describe their kernels as parallel loops that operate on logical data structures, such as multidimensional arrays in a C++ application. We fully expect that new innovations will continue to arise to further boost the productivity of developers in this exciting area.
The new CUDA 5.0 SDK and the new GPUs based on the Kepler architecture mark the beginning of the fourth generation of GPU computing that places real emphasis on support for developer productivity and modern software engineering practices. With the new capabilities, the range of applications that will be able to get reasonable performance at minimal development cost will expand significantly. We expect that developers will immediately notice the reduction in application development, porting, and maintenance cost compared to previous CUDA systems. The existing applications developed with Thrust and similar high-level tools that automatically generate CUDA code will also likely get an immediate boost in their performance. While the benefit of hardware enhancements in memory architecture, kernel execution control, and compute core performance will be visible in the associated SDK release, the true potential of these enhancements may take years to be fully exploited in the SDKs and runtimes. For example, the true potential of the hardware virtual memory capability will likely be fully achieved only when a shared global address space runtime that supports direct GPU I/O and peer-to-peer data transfer for multi-GPU systems becomes widely available. We predict an exciting time for innovations from both industry and academia in programming tools and runtime environments for massively parallel computing in the next few years.
Enjoy the ride!
1. Gelado, I., Navarro, N., Stone, J., Patel, S., & Hwu, W. W. (2009). An asymmetric distributed shared memory model for heterogeneous parallel systems, Technical Report, IMPACT Group, University of Illinois, Urbana-Champaign.
2. Mahlke, S. A., Hank, R. E., MCormick, J. E., August, D. I., & Hwu, W. W. (June 1995). A comparison of full and partial predicated execution support for ILP processors, Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, pp. 138–150.
3. Stone, J. E., & Hwu, W. W. (2009).WorkForce: A Lightweight Framework for Managing Multi-GPU Computations, Technical Report, IMPACT Group, University of Illinois, Urbana-Champaign.
4. Satish, N., Harris, M., & Garland. M. (May 2009). Designing efficient sorting algorithms for many core GPUs, Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, pp. 177-187.
5. Sengupta, S., Harris, M., Zhang, Y., & Owens, J. D. (Aug. 2007). Scan Primitives for GPU computing, Proceedings of Graphics Hardware 2007, San Diego, California, pp. 97–106.
6. Stone, J. E., Saam, J., Hardy, D. J., Vandivort, K. L., Hwu, W. W., & Schulten, K. (March 8, 2009). High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs, the second GPGPU workshop, ACM/IEEE Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 9–18.