Glossary
Abstraction:
In the case of TBB, abstraction serves to separate the work into work appropriate for a programmer and work best left to a runtime. The goal of such an abstraction is to deliver scalable high performance on a variety of multicore and many-core systems, and even heterogeneous platforms, without requiring rewriting of the code. This careful division of responsibilities leaves the programmer to expose opportunities for parallelism, and the runtime responsible for mapping the opportunities to the hardware. Code written to the abstraction will be free of parameterization for cache sizes, number of cores, and even consistency of performance from processing unit to processing unit.
Affinity:
The specification of methods to associate a particular software thread to a particular hardware thread usually with the objective of getting better or more predictable performance. Affinity specifications include the concept of being maximally spread apart to reduce contention(scatter), or to pack tightly (compact) to minimize distances for communication. OpenMP supports a rich set of affinity controls at various levels from abstract to full manual control. Fortran 2008 does not specify controls, but Intel reuses the OpenMP controls for “do concurrent.” Threading Building Blocks (TBB) provides an abstract loop-to-loop affinity biasing capability.
Algorithm
is a term that TBB has used in association with general, reusable solution to a common parallel programming problems. TBB, and this book, therefore uses the term ‘parallel algorithm’ when ‘parallel pattern’ would also be an appropriate description.
Amdahl’s Law:
Speedup is limited by the nonparallelizable serial portion of the work. A program where two thirds of the program can be run in parallel and one third of the original nonparallel program cannot be sped up by parallelism will find that speedup can only approach 3× and never exceed it assuming the same work is done. If scaling the problem size places more demands on the parallel portions of the program, then Amdahl’s Law is not as bad as it may seem. See
Gustafson’s law
.
Atom
is touted as a hackable text editor for the twenty-first century, and it is open source. Rafa says “I also love emacs, but now Atom is winning this space on my Mac.” Compare to the vi and emacs editors.
Atomic operation
is an operation that is guaranteed to appear as if it occurred indivisibly without interference from other threads. For example, a processor might provide a memory increment operation. This operation needs to read a value from memory, increment it, and write it back to memory. An atomic increment guarantees that the final memory value is the same as would have occurred if no other operations on that memory location were allowed to happen between the read and the write.
Barrier:
When a computation is broken into phases, it is often necessary to ensure that all threads complete all the work in one phase before any thread moves onto another phase. A barrier is a form of synchronization that ensures this: threads arriving at a barrier wait there until the last thread arrives, then all threads continue. A barrier can be implemented using atomic operations. For example, all threads might try to increment a shared variable, then block if the value of that variable does not equal the number of threads that need to synchronize at the barrier. The last thread to arrive can then reset the barrier to zero and release all the blocked threads.
Block
can be used in two senses: (1) a state in which a thread is unable to proceed while it waits for some synchronization event and (2) a region of memory. The second meaning is also used in the sense of dividing a loop into a set of parallel tasks of a suitable granularity.
Cache
is a part of memory system that stores copies of data temporarily in a fast memory so that future uses for that data can be handled more quickly than if the request had to be fetched again from a more distant storage. Caches are generally automatic and are designed to enhance programs with temporal locality and/or spatial locality. Caching systems in modern computers are usually multileveled.
Cache friendly
is a characteristic of an application in which performance increases as problem size increases but then levels off as the bandwidth limit is reached.
Cache lines
are the units in which data are retrieved and held by a cache, which in order to exploit spatial locality are generally larger than a word. The general trend is for increasing cache line sizes, which are generally large enough to hold at least two double-precision floating-point numbers, but unlikely to hold more than eight on any current design. Larger cache lines allow for more efficient bulk transfers from main memory but worsen certain issues including false sharing, which generally degrades performance.
Cache oblivious algorithm
is any algorithm which performs well, without modification, on multiple machines memory organization such as different levels of cache having different sizes. Since such algorithms are carefully designed to exhibit compact memory reuse, it seems like it would have made more sense to call such algorithms cache agnostic. The term
oblivious
is a reference to the fact that such algorithms are not aware of the parameters of the memory subsystem, such as the cache sizes or relative speeds. This is in contrast with earlier efforts to carefully block algorithms for specific cache hardware.
Cache unfriendly
is a characteristic of an application in which the memory footprint of the workloads needs to be optimal. In this case, we see that performance stays constant or increases as problem size reaches the optimal size and then performance decreases as problem size increases. In these workloads, there is a definite “sweet spot.”
Clusters
are a set of computers with distributed memory communicating over a high-speed interconnect. The individual computers are often called
nodes
. TBB is used at the node level within a cluster, although multiple nodes are commonly programmed with TBB and then connected (usually with MPI).
Communication:
Any exchange of data or
synchronization
between software tasks or threads. Understanding that communication costs are often a limiting factor in scaling is a critical concept for parallel programming.
Composability:
The ability to use two components in concert with each other without causing failure or unreasonable conflict (ideally no conflict). Limitations on composability, if they exist, are best when completely diagnosed at build time instead of requiring any testing. Composability problems that manifest only at runtime are the biggest problem with non-composable systems. Can refer to system features, programming models, or software components.
Concurrent
means that things are logically happening simultaneously. Two tasks that are both logically active at some point in time are considered to be concurrent. This is in contrast with
parallel
.
Core:
A separate subprocessor on a multicore processor. A core should be able to support (at least one) separate and divergent flow of control from other cores on the same processor.
Data parallelism
is an approach to parallelism that attempts to be oriented around data rather than tasks. However, in reality, successful strategies in parallel algorithm development tend to focus on exploiting the parallelism in data, because data decomposition (generating tasks for different units of data) scales, but functional decomposition (generation of heterogeneous tasks for different functions) does not. See Amdahl’s Law and Gustafson-Barsis’ law.
Deadlock
is a programming error. Deadlock occurs when at least two tasks wait for each other and each will not resume until the other task proceeds. This happens easily when code requires locking multiple mutexes. For example, each task can be holding a mutex required by the other task.
Deterministic
refers to a deterministic algorithm which is an algorithm that behaves predictably. Given a particular input, a deterministic algorithm will always produce the same output. The definition of what is the “same” may be important due to limited precision in mathematical operations and the likelihood that optimizations including parallelization will rearrange the order of operations. These are often referred to as “rounding” differences, which result when the order of mathematical operations to compute the answer differ between the original program and the final concurrent program. Concurrency is not the only factor that can lead to
nondeterministic
algorithms but in practice it is often the cause. Use of programming models with sequential semantics and eliminating data races with proper access controls will generally eliminate the major effects of concurrency other than the “rounding” differences.
Distributed memory
is memory that is physically located in separate computers. An indirect interface, such as message passing, is required to access memory on remote computers, while local memory can be accessed directly. Distributed memory is typically supported by clusters which, for purposes of this definition, we are considering to be a collection of computers. Since the memory on attached coprocessors also cannot typically be addressed directly from the host, it can be considered, for functional purposes, to be a form of distributed memory.
DSP
(Digital Signal Processor) is a computing device designed specifically for digital signal processing tasks such as those associated with radio communications including filters, FFTs, and analog to digital conversions. The computational capabilities of DSPs alongside CPUs gave rise to some of the earliest examples of heterogeneous platforms and various programming languages extensions to control and interact with a DSP. OpenCL is a programming model that can help harness the compute aspects of DSPs. See also, heterogeneous platforms.
EFLOP/s
(ExaFLOP/s) = 10^18 Floating-Point Operations per second.
EFLOPs
(ExaFLOPs) = 10^18 Floating-Point Operations.
emacs
is the best text editor in the world (according to James), and it is open source. Compare to the vi editor. “emacs” is the first package James installs on any computer that he uses.
Embarrassing parallelism
is a description of an algorithm if it can be decomposed into a large number of independent tasks with little or no synchronization or communication required.
False sharing:
Two separate tasks in two separate cores may write to separate locations in memory, but if those memory locations happened to be allocated in the same cache line, the cache coherence hardware will attempt to keep the cache lines coherent, resulting in extra interprocessor communication and reduced performance, even though the tasks are not actually sharing data.
Far memory:
In a NUMA system, memory that has longer access times than the near memory. The view of which parts of memory are near vs. far depends on the process from which code is running. We also refer to this memory as nonlocal memory (in contrast to local memory) in Chapter
20
.
Floating-point number
is a format for numbers in computers characterized by trading a higher range for the numbers for a reduced precision by using the bits available for a number (mantissa) and a shift count (exponent) that places the point to the left or right of a fixed position. In contrast, fixed-point representations lack an explicit exponent thereby allowing all bits to be used for number (mantissa).
Floating-point operation
includes add, multiply, subtract, and more, done to floating-point numbers.
FLOP/s
= Floating-Point Operations per second.
FLOPs
= Floating-Point Operations.
Flow Graph
is a way to describe an algorithm using graph notation. Use of graph notation means that a flow graph consists of computational nodes connected by edges denoting possible control flow.
Forward scaling
is the concept of having a program or algorithm scalable already in threads and/or vectors so as to be ready to take advantage of growth of parallelism in future hardware with a simple recompile with a new compiler or relink to a new library. Using the right abstractions to express parallelism is normally a key to enabling forward scaling when writing a parallel program.
FPGA
(Field Programmable Array) is a device that integrates a large number of gates (and often higher-level constructs such as DSPs, floating-point units, or network controllers) that remain unconnected to each other until the device is programmed. Programming was originally a whole chip process that was intended to be done once when a system was started, but modern FPGAs support partial reconfigurability and are often dynamically loaded with new programs as a matter of course. Traditionally, FPGAs were viewed as a way to consolidate a large number of discrete chips in a design into a single FPGA – usually saving board space, power, and overall cost. As such, FPGAs were programmed using tools similar to those used to design circuitry at a board or chip level – called high-level description languages (e.g., VHDL or Verilog). More recent use to harness FPGAs as compute engines has used the OpenCL programming model.
future-proofed:
A computer program written in a manner so it will survive future computer architecture changes without requiring significant changes to the program itself. Generally, the more abstract a programming method is, the more
future-proof
that program is. Lower level programming methods that in some way mirror computer architectural details will be less able to survive the future without change. Writing in an abstract, more
future-proof
fashion may involve tradeoffs in efficiency, however.
GFLOP/s
(GigaFLOP/s) = 10^9 Floating-Point Operations per second.
GFLOPs
(GigaFLOPs) = 10^9 Floating-Point Operations.
GPU
(Graphic Processing Unit) is a computing device designed to reform calculations associated with graphics such as lighting, transformations, clipping, and rendering. The computational capabilities of GPUs were originally designed solely for use in a “graphical pipeline” sitting between a general-purpose compute device (CPU) and displays. The emergence of programming support for using the computation without sending results to the display, and subsequent extensions to the designs of the GPU, lead to a more generalized compute capability being associated with many GPUs. OpenCL and CUDA are two popular programming models utilized to harness the compute aspects of GPUs. See also, heterogeneous platforms.
Grain
, as in
coarse-grained parallelism
or
fine-grained parallelism, or grain size
, all refer to the concept of “how much work” gets done before moving to a new task and/or potentially synchronizing. Programs scale best when grains are as large as possible (so threads can run independently) but small enough to keep every compute resource fully busy (load-balancing). These two factors operate somewhat at odds with each other, which creates the need to consider grain size. TBB works to automate partitioning, but there is never a perfect world in which a programmer cannot help tune for the best performance based on knowledge of their algorithms.
Gustafson-Barsis’ law
is a different view on
Amdahl’s Law
that factors in the fact that as problem sizes grow, the serial portion of computations tend to shrink as a percentage of the total work to be done.
Hardware thread
is a hardware implementation of a task with a separate flow of control. Multiple hardware threads can be implemented using multiple cores, or can run concurrently or simultaneously on one core in order to hide latency using methods such as hyper-threading of a processor core. In the latter case (hyperthreading or simultaneous multithreading, SMT), it is said that a physical core features several logical cores (or hardware threads).
Heterogenous platforms
consist of a mixture of compute devices instead of a homogeneous collection of only CPUs. Heterogenous computing is usually employed to provide specific acceleration via an attached device, such as a GPU, DSP, FPGA, and so on. See also
OpenCL
.
High-Performance Computing (HPC)
refers to the highest performance computing available at a point in time, which today generally means at least a petaFLOP/s of computational capability. The term HPC is occasionally used as a synonym for supercomputing, although supercomputing is probably more specific to even higher performance systems. While the use of HPC is spreading to more industries, it is generally associated with helping solve the most challenging problems in science and engineering. High-performance data analytics workloads, often using Artificial Intelligence (AI) and Machine Learning (ML) techniques, qualify as HPC workloads in their larger instantiations and often combine well with long standing (traditional) HPC workloads.
Hyper-threading
refers to multithreading on a single processor core with the purpose of more fully utilizing the functional units in an out-of-order core by bringing together more instructions than executable by one software thread. With hyper-threading, multiple hardware threads may run on one core and share resources, but some benefit is still obtained from parallelism or concurrency. Typically, each hyper-thread has, at least, its own register file and program counter, so that switching between hyper-threads is relatively lightweight. Hyper-threading is associated with Intel, see also simultaneous multithreading.
Latency
is the time it takes to complete a task; that is, the time between when the task begins and when it ends. Latency has units of time. The scale can be anywhere from nanoseconds to days. Lower latency is better in general.
Latency hiding
schedules computations on a processing element while other tasks using that core are waiting for long-latency operations to complete, such as memory or disk transfers. The latency is not actually hidden, since each task still takes the same time to complete, but more tasks can be completed in a given time since resources are shared more efficiently, so throughput is improved.
Load balancing
assigns tasks to resources while handling uneven sizes of tasks. Optimally, the goal of load balancing is to keep all compute devices busy with minimal waste due to overhead.
Locality
refers to utilizing memory locations that are closer, rather than further, apart. This will maximize reuse of cache lines, memory pages, and so on. Maintaining a high degree of locality of reference is a key to scaling.
Lock
is a mechanism for implementing
mutual exclusion
. Before entering a mutual exclusion region, a thread must first try to acquire a lock on that region. If the lock has already been acquired by another thread, the current thread must
block
, which it may do by either suspending operation or spinning. When the lock is released, then the current thread is free to acquire it. Locks can be implemented using
atomic operations
, which are themselves a form of mutual exclusion on basic operations, implemented in hardware.
Loop-carried dependence
if the same data item (e.g., element [3] of an array) is written in one iteration of a loop and is read in a different iteration of a loop, there is said to be a loop-carried dependence. If there are no loop-carried dependencies, a loop can be vectorized or parallelized. If there is a loop-carried dependence, the direction (prior iteration vs. future iteration, also known as backward or forward) and the distance (the number of iterations separating the read and write) must be considered.
Many-core processor
is a
multicore
processor with so many cores that in practice we do not enumerate them; there are just “lots.” The term has been generally used with processors with 32 or more cores, but there is no precise definition.
Megahertz era
is a historical period of time during which processors doubled clock rates at a rate similar to the doubling of transistors in a design, roughly every 2 years. Such rapid rise in processor clock speeds ceased at just under 4 GHz (four thousand megahertz) in 2004. Designs shifted toward adding more cores marking the shift to the
multicore era
.
Moore’s Law
is an observation that, over the history of semiconductors, the number of transistors in a dense integrated circuit has doubled approximately every 2 years.
Message Passing Interface (MPI)
is an industry-standard message-passing system designed to exchange data on a wide variety of parallel computers.
Multicore
is a processor with multiple subprocessors, each subprocessor (known as a
core
) supporting at least one hardware thread.
Multicore era
is the time after which processor designs shifted away from rapidly rising clock rates and shifted toward adding more cores. This era began roughly in 2005.
Node (in a cluster)
refers to a shared memory computer, often on a single board with multiple processors, that is connected with other nodes to form a
cluster
computer or supercomputer.
Nondeterministic:
Exhibiting a lack of deterministic behavior, so results can vary from run to run of an algorithm. Concurrency is not the only factor that can lead to nondeterministic algorithms, but in practice it is often the cause. See more in the definition for
deterministic
.
Non-Uniform Memory Access (NUMA):
Used to categorize memory design characteristics in a distributed memory architecture. NUMA = memory access latency is different for different memories. UMA = memory access latency is same for all memory. Compare with UMA. See Chapter
20
.
Offload:
Placing part of a computation on an attached device such as an FPGA, GPU, or other accelerator.
OpenCL (Open Computing Language)
is a framework for writing programs that execute across heterogeneous platforms. OpenCL provides host APIs for controlling offloading and attached devices, and extensions to C/C++ to express code to run on the attached accelerator (GPUs, DSPs, FPGAs, etc.) with the ability to use the CPU as a fallback if the attached device is not present or available at runtime.
OpenMP
is an API that supports multiplatform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. It is made up of a set of compiler directives, library routines, and environment variables that influence runtime behavior. OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board and is jointly defined by a group of major computer hardware and software vendors (
http://openmp.org
).
Parallel
means actually happening simultaneously. Two tasks that are both actually doing work at some point in time are considered to be operating in parallel. When a distinction is made between concurrent and parallel, the key is whether work can ever be done simultaneously. Multiplexing of a single processor core, by multitasking operating systems, has allowed concurrency for decades even when simultaneous execution was impossible because there was only one processing core.
Parallelism
is doing more than one thing at a time. Attempts to classify types of parallelism are numerous.
Parallelization
is the act of transforming code to enable simultaneous activities. The parallelization of a program allows at least parts of it to execute in parallel.
Pattern
is a general, reusable solution to a common problem. Historically, TBB has used the term parallel algorithm when parallel pattern would also be an appropriate description.
PFLOP/s
(PetaFLOP/s) = 10^15 Floating-Point Operations per second.
PFLOPs
(PetaFLOPs) = 10^15 Floating-Point Operations.
Race conditions
are nondeterministic behaviors in a parallel program that is generally a programming error. A race condition occurs when concurrent tasks perform operations on the same memory location without proper synchronization, and one of the memory operations is a write. Code with a race may operate correctly sometimes and fail other times.
Recursion
is the act of a function being reentered while an instance of the function is still active in the same thread of execution. In the simplest and most common case, a function directly calls itself, although recursion can also occur between multiple functions. Recursion is supported by storing the state for the continuations of partially completed functions in dynamically allocated memory, such as on a stack, although if higher-order functions are supported a more complex memory allocation scheme may be required. Bounding the amount of recursion can be important to prevent excessive use of memory.
Scalability
is a measure of the increase in performance as a function of the availability of more hardware to use in parallel.
Scalable:
An application is
scalable
if its performance increases when additional parallel hardware resources are added. The term
strong-scaling
refers to scaling that occurs while a problem size does not need to be changed as more compute is available in order to achieve scaling.
Weak-scaling
refers to scaling that occurs only when a problem size is scaled up when additional compute is available. See
scalability
.
Serial
means neither concurrent nor parallel.
Serialization
occurs when the tasks in a potentially parallel algorithm are executed in a specific serial order, typically due to resource constraints. The opposite of parallelization.
Shared memory:
When two units of parallel work can access data in the same location. Normally doing this safely requires synchronization. The units of parallel work, processes, threads, and tasks can all share data this way, if the physical memory system allows it. However, processes do not share memory by default and special calls to the operating system are required to set it up.
SIMD:
Single-instruction-multiple-data referring to the ability to process multiple pieces of data (such as elements of an array) with all the same operation. SIMD is a computer architecture within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.
Simultaneous multithreading
refers to multithreading on a single processor core. See also, hyper-threading.
Software thread
is a virtual hardware thread; in other words, a single flow of execution in software intended to map one for one to a hardware thread. An operating system typically enables many more software threads to exist than there are actual hardware threads, by mapping software threads to hardware threads as necessary. Having more software threads than hardware threads is known as
Oversubscription
.
Spatial locality:
Nearby when measured in terms of distance (in memory address). Compare with temporal locality. Spatial locality refers to a program behavior where the use of one data element indicates that data nearby, often the next data element, will probably be used soon. Algorithms exhibiting good spatial locality in data usage can benefit from cache lines structures and prefetching hardware, both common components in modern computers.
Speedup
is the ratio between the latency for solving a problem with one processing unit vs. the latency for solving the same problem with multiple processing units in parallel.
SPMD:
Single-program-multiple-data refers to the ability to process multiple pieces of data (such as elements of an array) with the same program, in contrast with a more restrictive SIMD architecture. SPMD most often refers to message passing programming on distributed memory computer architectures. SPMD is a subcategory of MIMD computer architectures within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.
STL
(Standard Template Library) is a part of the C++ standard
.
Strangled scaling
refers to a programming error in which the performance of parallel code is poor due to high contention or overhead, so much so that it may underperform the nonparallel (serial) code.
Symmetric Multiprocessor (SMP)
is a multiprocessor system with shared memory and running a single operating system.
Synchronization:
The coordination, of tasks or threads, in order to obtain the desired runtime order. Commonly used to avoid undesired race conditions.
Task:
A lightweight unit of potential parallelism with its own control flow, generally implemented at a user-level as opposed to OS-managed threads. Unlike threads, tasks are usually serialized on a single core and run to completion. When contrasted with “thread” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for executing tasks in parallel, while tasks are units of work that merely provide the
opportunity
for parallel execution; tasks are not themselves a mechanism of parallel execution.
Task parallelism:
An attempt to classify parallelism as more oriented around tasks than data. We deliberately avoid this term, task parallelism, because its meaning varies. In particular, elsewhere “task parallelism” can refer to tasks generated by functional decomposition
or
to irregular tasks that still generated by data decomposition. In this book, any parallelism generated by data decomposition, regular or irregular, is considered data parallelism.
TBB:
See Threading Building Blocks (TBB).
Temporal locality
means nearby when measured in terms of time. Compare with spatial locality. Temporal locality refers to a program behavior in which data is likely to be reused relatively soon. Algorithms exhibiting good temporal locality in data usage can benefit from data caching, which is common in modern computers. It is not unusual to be able to achieve both temporal and spatial locality in data usage. Computer systems are generally more likely to achieve optimal performance when both are achieved hence the interest in algorithm design to do so.
Thread
could refer to a
software
or
hardware
thread. In general, a “software thread” is any software unit of parallel work with an independent flow of control, and a “hardware thread” is any hardware unit capable of executing a single flow of control (in particular, a hardware unit that maintains a single program counter). When “thread” is compared with “task” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for implementing tasks. A multitasking or multithreading operating system will multiplex multiple software threads onto a single hardware thread by interleaving execution via software created time slices. A multicore or many-core processor consists of multiple cores to execute at least one independent software thread per core through duplication of hardware. A multithreaded or hyper-threaded processor core will multiplex a single core to execute multiple software threads through interleaving of software threads via hardware mechanisms.
Thread parallelism
is a mechanism for implementing parallelism in hardware using a separate flow of control for each task.
Thread local storage
refers to data which is purposefully allocated with the intent to only access from a single thread, at least during concurrent computations. The goal is to avoid need for synchronization during the most intense computational moments in an algorithm. A classic example of thread local storage is creating partial sums when working toward adding all numbers in a large array, by first adding subregions in parallel into local partial sums (also known as privatized variables) that, by nature of being local/private, require no global synchronization to sum into.
Threading Building Blocks (TBB)
is the most popular abstract solution for parallel programming in C++. TBB is an open source project created by Intel that has been ported to a very wide range of operating systems and processors from many vendors. OpenMP and TBB seldom compete for developers in reality. While more popular than OpenMP in terms of the number of developers using it, TBB is popular with C++ programmers whereas OpenMP is most used by Fortran and C programmers.
Throughput
is defined as the rate at which those tasks are completed, given a set of tasks to be performed. Throughput measures the rate of computation, and it is given in units of tasks per unit time. See
bandwidth
and
latency
.
TFLOP/s
(TeraFLOP/s) = 10^12 Floating-Point Operations per second.
TFLOPs
(TeraFLOPs) = 10^12 Floating-Point Operations.
Tiling
is when you divide a loop into a set of parallel tasks of a suitable granularity. In general, tiling consists of applying multiple steps on a smaller part of a problem instead of running each step on the whole problem one after the other. The purpose of tiling is to increase reuse of data in caches. Tiling can lead to dramatic performance increases when a whole problem does not fit in cache. We prefer the term “tiling” instead of “blocking” and “tile” instead of “block.” Tiling and tile have become the more common term in recent times.
TLB
is an abbreviation for Translation Lookaside Buffer. A TLB is a specialized cache that is used to hold translations of virtual to physical page addresses. The number of elements in the TLB determines how many pages of memory can be accessed simultaneously with good efficiency. Accessing a page not in the TLB will cause a TLB miss. A TLB miss typically causes a trap to the operating system so that the page table can be referenced and the TLB updated.
Trip count
is the number of times a given loop will execute (“trip”); same thing as
iteration count
.
Uniform Memory Access (UMA):
Used to categorize memory design characteristics in a distributed memory architecture. UMA = memory access latency is same for all memory. NUMA = memory access latency is different for different memories. Compare with NUMA. See Chapter
20
.
Vector operations
are low-level operations that can act on multiple data elements at once in SIMD fashion.
Vector parallelism
is a mechanism for implementing parallelism in hardware using the same flow of control on multiple data elements.
Vectorization
is the act of transforming code to enable simultaneous computations using vector hardware. Multiprocessor instructions such as MMX, SSE, AVX, AVX2, and AVX-512 instructions utilize vector hardware, but vector hardware outside of CPUs may come in other forms that are also targeted by vectorization. The vectorization of code tends to enhance performance because more data is processed per instruction than would be done otherwise. See also
vectorize
.
Vectorize
refers to converting a program from a scalar implementation to a vectorized implementation to utilize vector hardware such as SIMD instructions (e.g., MMX, SSE, AVX, AVX2, AVX-512). Vectorization is a specialized form of parallelism.
vi
is a text-based editor that was shipped with most UNIX and BSD systems written by Bill Joy, popular only to those who have yet to discover emacs (according to James). Yes, it is open source. Compares unfavorably to emacs and Atom. Yes, Ron, James did look at the “vi” nutshell book you gave to him… he still insists on using vi just long enough to get emacs downloaded and installed.
Virtual memory
decouples the address used by software from the physical addresses of real memory. The translation from virtual addresses to physical addresses is done in hardware that is initialized and controlled by the operating system.