Back End Sheet

Translation between GPU terms in book and official NVIDIA and OpenCL terms.

TypeMore Descriptive Name used in this BookOfficial CUDA/NVIDIA TermBook Definition and OpenCL TermsOfficial CUDA/NVIDIA Definition
Program AbstractionsVectorizable LoopGrid A vectorizable loop, executed on the GPU, made up of 1 or more “Thread Blocks” (or bodies of vectorized loop) that can execute in parallel. OpenCL name is “index range.” A Grid is an array of Thread Blocks that can execute concurrently, sequentially, or a mixture.
Body of Vectorized LoopThread Block A vectorized loop executed on a “Streaming Multiprocessor” (multithreaded SIMD processor), made up of 1 or more “Warps” (or threads of SIMD instructions). These “Warps” (SIMD Threads) can communicate via “Shared Memory” (Local Memory). OpenCL calls a thread block a “work group.” A Thread Block is an array of CUDA threads that execute concurrently together and can cooperate and communicate via Shared Memory and barrier synchronization. A Thread Block has a Thread Block ID within its Grid.
Sequence of SIMD Lane OperationsCUDA Thread A vertical cut of a “Warp” (or thread of SIMD instructions) corresponding to one element executed by one “Thread Processor” (or SIMD lane). Result is stored depending on mask. OpenCL calls a CUDA thread a “work item.” A CUDA Thread is a lightweight thread that executes a sequential program and can cooperate with other CUDA threads executing in the same Thread Block. A CUDA thread has a thread ID within its Thread Block.
Machine ObjectA Thread of SIMD InstructionsWarp A traditional thread, but it contains just SIMD instructions that are executed on a “Streaming Multiprocessor” (multithreaded SIMD processor). Results stored depending on a per element mask. A Warp is a set of parallel CUDA Threads (e.g., 32) that execute the same instruction together in a multithreaded SIMT/SIMD processor.
SIMD InstructionPTX Instruction A single SIMD instruction executed across the “Thread Processors” (SIMD lanes). A PTX instruction specifies an instruction executed by a CUDA Thread.
Processing HardwareMultithreaded SIMD ProcessorStreaming Multiprocessor Multithreaded SIMD processor that executes “Warps” (thread of SIMD instructions), independent of other SIMD processors. OpenCL calls it a “Compute Unit.” However, CUDA programmer writes program for one lane rather than for a “vector” of multiple SIMD lanes. A Streaming Multiprocessor (SM) is a multithreaded SIMT/SIMD processor that executes Warps of CUDA Threads. A SIMT program specifies the execution of one CUDA thread, rather than a vector of multiple SIMD lanes.
Thread Block SchedulerGiga Thread Engine Assigns multiple “Thread Blocks” (or body of vectorized loop) to “Streaming Multiprocessors” (multithreaded SIMD processors). Distributes and schedules Thread Blocks of a Grid to Streaming Multiprocessors as resources become available.
SIMD Thread SchedulerWarp Scheduler Hardware unit that schedules and issues “Warps” (threads of SIMD instructions) when they are ready to execute; includes a scoreboard to track “Warp” (SIMD thread) execution. A Warp Scheduler in a Streaming Multiprocessor schedules Warps for execution when their next instruction is ready to execute.
SIMD LaneThread Processor Hardware SIMD Lane that executes the operations in a “Warp” (thread of SIMD instructions) on a single element. Results stored depending on mask. OpenCL calls it a “Processing Element.” A Thread Processor is a datapath and register file portion of a Streaming Multiprocessor that executes operations for one or more lanes of a Warp.
Memory HardwareGPU MemoryGlobal Memory DRAM memory accessible by all “Streaming Multiprocessors” (or multithreaded SIMD processors) in a GPU. OpenCL calls it “Global Memory.” Global Memory is accessible by all CUDA Threads in any Thread Block in any Grid. Implemented as a region of DRAM, and may be cached.
Private MemoryLocal Memory Portion of DRAM memory private to each “Thread Processor” (SIMD lane). OpenCL calls it “Private Memory.” Private “thread-local” memory for a CUDA Thread. Implemented as a cached region of DRAM.
Local MemoryShared Memory Fast local SRAM for one “Streaming Multiprocessor” (multithreaded SIMD processor), unavailable to other Streaming Multiprocessors. OpenCL calls it “Local Memory.” Fast SRAM memory shared by the CUDA Threads composing a Thread Block, and private to that Thread Block. Used for communication among CUDA Threads in a Thread Block at barrier synchronization points.
SIMD Lane RegistersRegisters Registers in a single “Thread Processor” (SIMD lane) allocated across full “Thread Block” (or body of vectorized loop). Private registers for a CUDA Thread. Implemented as multithreaded register file for certain lanes of several warps for each thread processor.

Image