Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page Copyright Page Dedication Page Contents Preface Acknowledgments About the Author Part I
Chapter 1. Background
1.1. Our Approach 1.2. Code 1.3. Administrative Items 1.4. Road Map
Chapter 2. Hardware Architecture
2.1. CPU Configurations 2.2. Integrated GPUs 2.3. Multiple GPUs 2.4. Address Spaces in CUDA 2.5. CPU/GPU Interactions 2.6. GPU Architecture 2.7. Further Reading
Chapter 3. Software Architecture
3.1. Software Layers 3.2. Devices and Initialization 3.3. Contexts 3.4. Modules and Functions 3.5. Kernels (Functions) 3.6. Device Memory 3.7. Streams and Events 3.8. Host Memory 3.9. CUDA Arrays and Texturing 3.10. Graphics Interoperability 3.11. The CUDA Runtime and CUDA Driver API
Chapter 4. Software Environment
4.1. nvcc—CUDA Compiler Driver 4.2. ptxas—the PTX Assembler 4.3. cuobjdump 4.4. nvidia-smi 4.5. Amazon Web Services
Part II
Chapter 5. Memory
5.1. Host Memory 5.2. Global Memory 5.3. Constant Memory 5.4. Local Memory 5.5. Texture Memory 5.6. Shared Memory 5.7. Memory Copy
Chapter 6. Streams and Events
6.1. CPU/GPU Concurrency: Covering Driver Overhead 6.2. Asynchronous Memcpy 6.3. CUDA Events: CPU/GPU Synchronization 6.4. CUDA Events: Timing 6.5. Concurrent Copying and Kernel Processing 6.6. Mapped Pinned Memory 6.7. Concurrent Kernel Processing 6.8. GPU/GPU Synchronization: cudaStreamWaitEvent() 6.9. Source Code Reference
Chapter 7. Kernel Execution
7.1. Overview 7.2. Syntax 7.3. Blocks, Threads, Warps, and Lanes 7.4. Occupancy 7.5. Dynamic Parallelism
Chapter 8. Streaming Multiprocessors
8.1. Memory 8.2. Integer Support 8.3. Floating-Point Support 8.4. Conditional Code 8.5. Textures and Surfaces 8.6. Miscellaneous Instructions 8.7. Instruction Sets
Chapter 9. Multiple GPUs
9.1. Overview 9.2. Peer-to-Peer 9.3. UVA: Inferring Device from Address 9.4. Inter-GPU Synchronization 9.5. Single-Threaded Multi-GPU 9.6. Multithreaded Multi-GPU
Chapter 10. Texturing
10.1. Overview 10.2. Texture Memory 10.3. 1D Texturing 10.4. Texture as a Read Path 10.5. Texturing with Unnormalized Coordinates 10.6. Texturing with Normalized Coordinates 10.7. 1D Surface Read/Write 10.8. 2D Texturing 10.9. 2D Texturing: Copy Avoidance 10.10. 3D Texturing 10.11. Layered Textures 10.12. Optimal Block Sizing and Performance 10.13. Texturing Quick References
Part III
Chapter 11. Streaming Workloads
11.1. Device Memory 11.2. Asynchronous Memcpy 11.3. Streams 11.4. Mapped Pinned Memory 11.5. Performance and Summary
Chapter 12. Reduction
12.1. Overview 12.2. Two-Pass Reduction 12.3. Single-Pass Reduction 12.4. Reduction with Atomics 12.5. Arbitrary Block Sizes 12.6. Reduction Using Arbitrary Data Types 12.7. Predicate Reduction 12.8. Warp Reduction with Shuffle
Chapter 13. Scan
13.1. Definition and Variations 13.2. Overview 13.3. Scan and Circuit Design 13.4. CUDA Implementations 13.5. Warp Scans 13.6. Stream Compaction 13.7. References (Parallel Scan Algorithms) 13.8. Further Reading (Parallel Prefix Sum Circuits)
Chapter 14. N-Body
14.1. Introduction 14.2. Naïve Implementation 14.3. Shared Memory 14.4. Constant Memory 14.5. Warp Shuffle 14.6. Multiple GPUs and Scalability 14.7. CPU Optimizations 14.8. Conclusion 14.9. References and Further Reading
Chapter 15. Image Processing: Normalized Correlation
15.1. Overview 15.2. Naïve Texture-Texture Implementation 15.3. Template in Constant Memory 15.4. Image in Shared Memory 15.5. Further Optimizations 15.6. Source Code 15.7. Performance and Further Reading 15.8. Further Reading
Appendix A. The CUDA Handbook Library
A.1. Timing A.2. Threading A.3. Driver API Facilities A.4. Shmoos A.5. Command Line Parsing A.6. Error Handling
Glossary / TLA Decoder Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion