The CUDA Handbook · A Comprehensive Guide to GPU Programming by Wilt, Nicholas -- Read -- Imperial Library of Trantor

Index

Title Page Copyright Page Dedication Page Contents Preface Acknowledgments About the Author Part I

Chapter 1. Background

1.1. Our Approach 1.2. Code 1.3. Administrative Items 1.4. Road Map

Chapter 2. Hardware Architecture

2.1. CPU Configurations 2.2. Integrated GPUs 2.3. Multiple GPUs 2.4. Address Spaces in CUDA 2.5. CPU/GPU Interactions 2.6. GPU Architecture 2.7. Further Reading

Chapter 3. Software Architecture

3.1. Software Layers 3.2. Devices and Initialization 3.3. Contexts 3.4. Modules and Functions 3.5. Kernels (Functions) 3.6. Device Memory 3.7. Streams and Events 3.8. Host Memory 3.9. CUDA Arrays and Texturing 3.10. Graphics Interoperability 3.11. The CUDA Runtime and CUDA Driver API

Chapter 4. Software Environment

4.1. nvcc—CUDA Compiler Driver 4.2. ptxas—the PTX Assembler 4.3. cuobjdump 4.4. nvidia-smi 4.5. Amazon Web Services

Part II

Chapter 5. Memory

5.1. Host Memory 5.2. Global Memory 5.3. Constant Memory 5.4. Local Memory 5.5. Texture Memory 5.6. Shared Memory 5.7. Memory Copy

Chapter 6. Streams and Events

6.1. CPU/GPU Concurrency: Covering Driver Overhead 6.2. Asynchronous Memcpy 6.3. CUDA Events: CPU/GPU Synchronization 6.4. CUDA Events: Timing 6.5. Concurrent Copying and Kernel Processing 6.6. Mapped Pinned Memory 6.7. Concurrent Kernel Processing 6.8. GPU/GPU Synchronization: cudaStreamWaitEvent() 6.9. Source Code Reference

Chapter 7. Kernel Execution

7.1. Overview 7.2. Syntax 7.3. Blocks, Threads, Warps, and Lanes 7.4. Occupancy 7.5. Dynamic Parallelism

Chapter 8. Streaming Multiprocessors

8.1. Memory 8.2. Integer Support 8.3. Floating-Point Support 8.4. Conditional Code 8.5. Textures and Surfaces 8.6. Miscellaneous Instructions 8.7. Instruction Sets

Chapter 9. Multiple GPUs

9.1. Overview 9.2. Peer-to-Peer 9.3. UVA: Inferring Device from Address 9.4. Inter-GPU Synchronization 9.5. Single-Threaded Multi-GPU 9.6. Multithreaded Multi-GPU

Chapter 10. Texturing

10.1. Overview 10.2. Texture Memory 10.3. 1D Texturing 10.4. Texture as a Read Path 10.5. Texturing with Unnormalized Coordinates 10.6. Texturing with Normalized Coordinates 10.7. 1D Surface Read/Write 10.8. 2D Texturing 10.9. 2D Texturing: Copy Avoidance 10.10. 3D Texturing 10.11. Layered Textures 10.12. Optimal Block Sizing and Performance 10.13. Texturing Quick References

Part III

Chapter 11. Streaming Workloads

11.1. Device Memory 11.2. Asynchronous Memcpy 11.3. Streams 11.4. Mapped Pinned Memory 11.5. Performance and Summary

Chapter 12. Reduction

12.1. Overview 12.2. Two-Pass Reduction 12.3. Single-Pass Reduction 12.4. Reduction with Atomics 12.5. Arbitrary Block Sizes 12.6. Reduction Using Arbitrary Data Types 12.7. Predicate Reduction 12.8. Warp Reduction with Shuffle

Chapter 13. Scan

13.1. Definition and Variations 13.2. Overview 13.3. Scan and Circuit Design 13.4. CUDA Implementations 13.5. Warp Scans 13.6. Stream Compaction 13.7. References (Parallel Scan Algorithms) 13.8. Further Reading (Parallel Prefix Sum Circuits)

Chapter 14. N-Body

14.1. Introduction 14.2. Naïve Implementation 14.3. Shared Memory 14.4. Constant Memory 14.5. Warp Shuffle 14.6. Multiple GPUs and Scalability 14.7. CPU Optimizations 14.8. Conclusion 14.9. References and Further Reading

Chapter 15. Image Processing: Normalized Correlation

15.1. Overview 15.2. Naïve Texture-Texture Implementation 15.3. Template in Constant Memory 15.4. Image in Shared Memory 15.5. Further Optimizations 15.6. Source Code 15.7. Performance and Further Reading 15.8. Further Reading

Appendix A. The CUDA Handbook Library

A.1. Timing A.2. Threading A.3. Driver API Facilities A.4. Shmoos A.5. Command Line Parsing A.6. Error Handling

Glossary / TLA Decoder Index