Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Copyright Page
Dedication Page
Contents
Preface
Acknowledgments
About the Author
Part I
Chapter 1. Background
1.1. Our Approach
1.2. Code
1.3. Administrative Items
1.4. Road Map
Chapter 2. Hardware Architecture
2.1. CPU Configurations
2.2. Integrated GPUs
2.3. Multiple GPUs
2.4. Address Spaces in CUDA
2.5. CPU/GPU Interactions
2.6. GPU Architecture
2.7. Further Reading
Chapter 3. Software Architecture
3.1. Software Layers
3.2. Devices and Initialization
3.3. Contexts
3.4. Modules and Functions
3.5. Kernels (Functions)
3.6. Device Memory
3.7. Streams and Events
3.8. Host Memory
3.9. CUDA Arrays and Texturing
3.10. Graphics Interoperability
3.11. The CUDA Runtime and CUDA Driver API
Chapter 4. Software Environment
4.1. nvcc—CUDA Compiler Driver
4.2. ptxas—the PTX Assembler
4.3. cuobjdump
4.4. nvidia-smi
4.5. Amazon Web Services
Part II
Chapter 5. Memory
5.1. Host Memory
5.2. Global Memory
5.3. Constant Memory
5.4. Local Memory
5.5. Texture Memory
5.6. Shared Memory
5.7. Memory Copy
Chapter 6. Streams and Events
6.1. CPU/GPU Concurrency: Covering Driver Overhead
6.2. Asynchronous Memcpy
6.3. CUDA Events: CPU/GPU Synchronization
6.4. CUDA Events: Timing
6.5. Concurrent Copying and Kernel Processing
6.6. Mapped Pinned Memory
6.7. Concurrent Kernel Processing
6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()
6.9. Source Code Reference
Chapter 7. Kernel Execution
7.1. Overview
7.2. Syntax
7.3. Blocks, Threads, Warps, and Lanes
7.4. Occupancy
7.5. Dynamic Parallelism
Chapter 8. Streaming Multiprocessors
8.1. Memory
8.2. Integer Support
8.3. Floating-Point Support
8.4. Conditional Code
8.5. Textures and Surfaces
8.6. Miscellaneous Instructions
8.7. Instruction Sets
Chapter 9. Multiple GPUs
9.1. Overview
9.2. Peer-to-Peer
9.3. UVA: Inferring Device from Address
9.4. Inter-GPU Synchronization
9.5. Single-Threaded Multi-GPU
9.6. Multithreaded Multi-GPU
Chapter 10. Texturing
10.1. Overview
10.2. Texture Memory
10.3. 1D Texturing
10.4. Texture as a Read Path
10.5. Texturing with Unnormalized Coordinates
10.6. Texturing with Normalized Coordinates
10.7. 1D Surface Read/Write
10.8. 2D Texturing
10.9. 2D Texturing: Copy Avoidance
10.10. 3D Texturing
10.11. Layered Textures
10.12. Optimal Block Sizing and Performance
10.13. Texturing Quick References
Part III
Chapter 11. Streaming Workloads
11.1. Device Memory
11.2. Asynchronous Memcpy
11.3. Streams
11.4. Mapped Pinned Memory
11.5. Performance and Summary
Chapter 12. Reduction
12.1. Overview
12.2. Two-Pass Reduction
12.3. Single-Pass Reduction
12.4. Reduction with Atomics
12.5. Arbitrary Block Sizes
12.6. Reduction Using Arbitrary Data Types
12.7. Predicate Reduction
12.8. Warp Reduction with Shuffle
Chapter 13. Scan
13.1. Definition and Variations
13.2. Overview
13.3. Scan and Circuit Design
13.4. CUDA Implementations
13.5. Warp Scans
13.6. Stream Compaction
13.7. References (Parallel Scan Algorithms)
13.8. Further Reading (Parallel Prefix Sum Circuits)
Chapter 14. N-Body
14.1. Introduction
14.2. Naïve Implementation
14.3. Shared Memory
14.4. Constant Memory
14.5. Warp Shuffle
14.6. Multiple GPUs and Scalability
14.7. CPU Optimizations
14.8. Conclusion
14.9. References and Further Reading
Chapter 15. Image Processing: Normalized Correlation
15.1. Overview
15.2. Naïve Texture-Texture Implementation
15.3. Template in Constant Memory
15.4. Image in Shared Memory
15.5. Further Optimizations
15.6. Source Code
15.7. Performance and Further Reading
15.8. Further Reading
Appendix A. The CUDA Handbook Library
A.1. Timing
A.2. Threading
A.3. Driver API Facilities
A.4. Shmoos
A.5. Command Line Parsing
A.6. Error Handling
Glossary / TLA Decoder
Index
← Prev
Back
Next →
← Prev
Back
Next →