Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Copyright 2020 Manning Publications
welcome
brief contents
Part 1
1 Why parallel computing
1.1 Why should you learn about parallel computing?
1.1.1 What are the potential benefits of parallel computing?
1.1.2 Parallel computing cautions
1.2 The fundamental laws of parallel computing
1.2.1 The limit to parallel computing: Amdahl’s Law
1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law
1.3 How does parallel computing work?
1.3.1 Walk through a sample application
1.3.2 A hardware model for today’s heterogeneous parallel systems
1.3.3 The application/software model for today’s heterogeneous parallel systems
1.4 Categorizing parallel approaches
1.5 Parallel Strategies
1.6 Parallel speedup vs comparative speedups: two different measures
1.7 What will you learn in this book?
1.7.1 Exercises
1.8 Summary
2 Planning for parallel
2.1 Approaching a new project: the preparation
2.1.1 Version Control: creating a safety vault for your parallel code
2.1.2 Test Suites: the first step to creating a robust, reliable application
2.1.3 Finding and fixing memory issues
2.1.4 Improving code portability
2.2 Profiling step: probing the gap between system capabilities and application performance
2.3 Planning step: a foundation for success
2.3.1 Exploring with benchmarks and mini-apps
2.3.2 Design of the core data structures and code modularity
2.3.3 Algorithms: redesign for parallel
2.4 Implementation step: where it all happens
2.5 Commit Step: wrapping it up with quality
2.6 Further explorations
2.6.1 Additional Reading
2.6.2 Exercises
2.7 Summary
3 Performance limits and profiling
3.1 Know your application’s potential performance limits
3.2 Determine your hardware capabilities: benchmarking
3.2.1 Tools for gathering system characteristics
3.2.2 Calculating theoretical maximum FLOPS
3.2.3 The memory hierarchy and theoretical memory bandwidth
3.2.4 Empirical measurement of bandwidth and flops
3.2.5 Calculating the machine balance between flops and bandwidth
3.3 Characterizing your application: profiling
3.3.1 Profiling Tools
3.3.2 Empirical measurement of processor clock frequency and energy consumption
3.3.3 Tracking memory during runtime
3.4 Further explorations
3.4.1 Additional Reading
3.4.2 Exercises
3.5 Summary
4 Data design and performance models
4.1 Performance data structures: data-oriented design
4.1.1 Multidimensional arrays
4.1.2 Array of Structures (AOS) versus Structures of Arrays (SOA)
4.1.3 Array of Structure of Arrays (AOSOA)
4.2 Three C’s of cache misses: compulsory, capacity, conflict
4.3 Simple performance models: a case study
4.3.1 Full matrix data representations
4.3.2 Compressed sparse storage representations
4.4 Advanced performance models
4.5 Network messages
4.6 Further explorations
4.6.1 Additional reading
4.6.2 Exercises
4.7 Summary
5 Parallel algorithms and patterns
5.1 Algorithm analysis for parallel computing applications
5.2 Parallel algorithms: what are they?
5.3 What is a hash function?
5.4 Spatial hashing: a highly-parallel algorithm
5.4.1 Using perfect hashing for spatial mesh operations
5.4.2 Using compact hashing for spatial mesh operations
5.5 Prefix sum (scan) pattern and its importance in parallel computing
5.5.1 Step-efficient parallel scan operation
5.5.2 Work-efficient parallel scan operation
5.5.3 Parallel scan operations for large arrays
5.6 Parallel global sum: addressing the problem of associativity
5.7 Future of parallel algorithm research
5.8 Further explorations
5.8.1 Additional reading
5.8.2 Exercises
5.9 Summary
Part 2
6 Vectorization: FLOPs for free
6.1 Vectorization and Single-Instruction, Multiple-Data (SIMD) overview
6.2 Hardware trends
6.3 Vectorization methods
6.3.1 Optimized libraries provide performance for little effort
6.3.2 Auto-vectorization: the easy way to vectorization speed-up (most of the time*):
6.3.3 Teaching the compiler through hints -- pragmas and directives
6.3.4 Crappy loops, we got them: use vector intrinsics
6.3.5 Not for the faint of heart: using assembler code for vectorization
6.4 Programming style for better vectorization
6.5 Compiler flags relevant for vectorization for various compilers
6.6 OpenMP SIMD directives for better portability
6.7 Further explorations
6.7.1 Additional reading
6.7.2 Exercises
6.8 Summary
7 OpenMP that performs
7.1 OpenMP introduction
7.1.1 OpenMP concepts
7.1.2 A very simple OpenMP program
7.2 Typical OpenMP use cases: Loop-level, High-level, and MPI+OpenMP
7.2.1 Loop-level OpenMP for quick parallelization
7.2.2 High-level OpenMP for better parallel performance
7.2.3 MPI + OpenMP for extreme scalability
7.3 Examples of standard loop-level OpenMP
7.3.1 Loop level OpenMP: Vector addition example
7.3.2 Stream triad example
7.3.3 Loop level OpenMP: Stencil example
7.3.4 Performance of loop-level examples
7.3.5 Reduction example of a global sum using OpenMP threading
7.3.6 Potential loop-level OpenMP issues
7.4 Variable scope is critically important in OpenMP for correctness
7.5 Function-level OpenMP: making a whole function thread parallel
7.6 Improving parallel scalability with high-level OpenMP
7.6.1 How to implement high-level OpenMP
7.6.2 Example of implementing high-level OpenMP
7.7 Hybrid threading and vectorization with OpenMP
7.8 Advanced examples using OpenMP
7.8.1 Stencil example with a separate pass for the x and y directions
7.8.2 Kahan summation implementation with OpenMP threading
7.8.3 Threaded implementation of the prefix scan algorithm
7.9 Threading tools essential for robust implementations
7.9.1 Using Allinea-Map to get a quick high-level profile of your application
7.9.2 Finding your thread race conditions with Intel thread inspector
7.10 Example of task-based support algorithm
7.11 Further explorations
7.11.1 Additional reading
7.11.2 Exercises
7.12 Summary
8 MPI: the parallel backbone
8.1 The basics for an MPI program
8.1.1 Basic MPI function calls for every MPI program
8.1.2 Compiler wrappers for simpler MPI programs
8.1.3 Using parallel startup commands
8.1.4 Minimum working example of an MPI program
8.2 The send and receive commands for process-to-process communication
8.3 Collective communication: a powerful component of MPI
8.3.1 Using a barrier to synchronize timers
8.3.2 Using the broadcast to handle small file input
8.3.3 Using a reduction to get a single value from across all processes
8.3.4 Using gather to put order in debug printouts
8.3.5 Using scatter and gather to send data out to processes for work
8.4 Data parallel examples
8.4.1 Stream triad to measure bandwidth on the node
8.4.2 Ghost cell exchanges in a two-dimensional mesh
8.4.3 Ghost cell exchanges in a three-dimensional stencil calculation
8.5 Advanced MPI functionality to simplify code and enable optimizations
8.5.1 Using custom MPI datatypes for performance and code simplification
8.5.2 Cartesian topology support in MPI
8.5.3 Performance tests of ghost cell exchange variants
8.6 Hybrid MPI+OpenMP for extreme scalability
8.6.1 Hybrid MPI+OpenMP benefits
8.6.2 MPI+OpenMP example
8.7 Further explorations
8.7.1 Additional reading
8.7.2 Exercises
8.8 Summary
Part 3
9 GPU architectures and concepts
9.1 The CPU-GPU system as an accelerated computational platform
9.1.1 Integrated GPUs: an underutilized option on commodity-based systems
9.1.2 Dedicated GPUs: the workhorse option
9.2 The GPU and the thread engine
9.2.1 The compute unit is the multiprocessor
9.2.2 Processing elements are the individual processors
9.2.3 Multiple data operations by each processing element
9.2.4 Calculating the peak theoretical flops for some leading GPUs
9.3 Characteristics of GPU memory spaces
9.3.1 Calculating theoretical peak memory bandwidth
9.3.2 Measuring the GPU stream benchmark
9.3.3 Roofline performance model for GPUs
9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
9.4 The PCI bus: CPU-GPU data transfer overhead
9.4.1 Theoretical Bandwidth of the PCI Bus
9.4.2 A Benchmark Application for PCI bandwidth
9.5 Multi-GPU platforms and MPI
9.5.1 Optimizing the data movement from one GPU across a network to another
9.5.2 A higher performance alternative to the PCI Bus
9.6 Potential benefits of GPU accelerated platforms
9.6.1 Reducing time-to-solution
9.6.2 Reducing energy use with GPUs
9.6.3 Reduction in cloud computing costs with GPUs
9.7 When to use GPUs
9.8 Further explorations
9.8.1 Additional reading
9.8.2 Exercises
9.9 Summary
10 GPU programming model
10.1 GPU programming abstractions: a common framework
10.1.1 Data decomposition into independent units of work: an NDRange or grid
10.1.2 Work groups provide a right-sized chunk of work
10.1.3 Subgroups, warps or wavefronts execute in lockstep
10.1.4 Work item: the basic unit of operation
10.1.5 SIMD or vector hardware
10.2 The code structure for the GPU programming model
10.2.1 Me programming: the concept of a parallel kernel
10.2.2 Thread indices: mapping the local tile to the global world
10.2.3 Index sets
10.2.4 How to address memory resources in your GPU programming model
10.3 Optimizing GPU resource usage
10.3.1 How many registers does my kernel use?
10.3.2 Occupancy - making more work available for work group scheduling
10.4 Reduction pattern requires synchronization across work groups
10.5 Asynchronous computing through queues (streams)
10.6 Developing a plan to parallelize an application for GPUs
10.6.1 Case 1: 3D atmospheric simulation
10.6.2 Case 2: Unstructured mesh application
10.7 Further explorations
10.7.1 Additional reading
10.7.2 Exercises
10.8 Summary
11 Directive-based GPU programming
11.1 Process to apply directives and pragmas for a GPU implementation
11.2 OpenACC: the easiest way to run on your GPU
11.2.1 Compiling OpenACC code
11.2.2 Parallel compute regions in OpenACC for accelerating computations
11.2.3 Using directives to reduce data movement between the CPU and the GPU
11.2.4 Optimizing the GPU kernels
11.2.5 Summary of performance results for stream triad
11.2.6 Advanced OpenACC techniques
11.3 OpenMP: the heavyweight champ enters the world of accelerators
11.3.1 Compiling OpenMP code
11.3.2 Generating parallel work on the GPU with OpenMP
11.3.3 Creating data regions to control data movement to the GPU with OpenMP
11.3.4 Optimizing OpenMP for GPUs
11.3.5 Advanced OpenMP for GPUs
11.4 Further explorations
11.4.1 Additional reading
11.4.2 Exercises
11.5 Summary
12 GPU languages: getting down to basics
12.1 Features of a native GPU programming language
12.2 CUDA and HIP GPU languages; the low-level performance option
12.2.1 Writing and building your first CUDA application
12.2.2 A reduction kernel in CUDA: life gets complicated
12.2.3 Hipifying the CUDA code
12.3 OpenCL for a portable open source GPU language
12.3.1 Writing and building your first OpenCL application
12.3.2 Reductions in OpenCL
12.4 SYCL: an experimental C++ implementation goes mainstream
12.5 Higher-level languages for performance portability
12.5.1 Kokkos: a performance portability ecosystem.
12.5.2 RAJA for a more adaptable performance portability layer
12.6 Further explorations
12.6.1 Additional reading
12.6.2 Exercises
12.7 Summary
13 GPU profiling and tools
13.1 Profiling tools overview
13.2 How to select a good workflow
13.3 Example problem: shallow water simulation
13.4 A sample of a profiling workflow
13.4.1 Run the shallow water code
13.4.2 Profile the CPU code:
13.4.3 Add OpenACC compute directives
13.4.4 Add data movement directives
13.4.5 The Nvidia Nsight suite of tools can be a powerful development aid
13.4.6 Code XL for the AMD GPU ecosystem
13.5 Don’t get lost in the swamp: focus on the important metrics
13.5.1 Occupancy
13.5.2 Issue efficiency
13.5.3 Achieved Bandwidth
13.6 Containers and Virtual Machines provide alternate workflows
13.6.1 Docker containers as a workaround
13.6.2 Virtual machines using Virtual Box
13.7 Cloud options: a flexible and portable capability
13.8 Further Explorations
13.8.1 Additional Reading
13.8.2 Exercises
13.9 Summary
Appendix A: References
Appendix B: Solutions to Exercises
Appendix C: Glossary
← Prev
Back
Next →
← Prev
Back
Next →