Parallel and High Performance Computing MEAP V09 by Robert Robey, Yuliana Zamora -- Read -- Imperial Library of Trantor

Index

1.1 Why should you learn about parallel computing?

1.1.1 What are the potential benefits of parallel computing? 1.1.2 Parallel computing cautions

1.2 The fundamental laws of parallel computing

1.2.1 The limit to parallel computing: Amdahl’s Law 1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law

1.3 How does parallel computing work?

1.3.1 Walk through a sample application 1.3.2 A hardware model for today’s heterogeneous parallel systems 1.3.3 The application/software model for today’s heterogeneous parallel systems

1.4 Categorizing parallel approaches 1.5 Parallel Strategies 1.6 Parallel speedup vs comparative speedups: two different measures 1.7 What will you learn in this book?

1.7.1 Exercises

1.8 Summary

2 Planning for parallel

2.1 Approaching a new project: the preparation

2.1.1 Version Control: creating a safety vault for your parallel code 2.1.2 Test Suites: the first step to creating a robust, reliable application 2.1.3 Finding and fixing memory issues 2.1.4 Improving code portability

2.2 Profiling step: probing the gap between system capabilities and application performance 2.3 Planning step: a foundation for success

2.3.1 Exploring with benchmarks and mini-apps 2.3.2 Design of the core data structures and code modularity 2.3.3 Algorithms: redesign for parallel

2.4 Implementation step: where it all happens 2.5 Commit Step: wrapping it up with quality 2.6 Further explorations

2.6.1 Additional Reading 2.6.2 Exercises

2.7 Summary

3 Performance limits and profiling

3.1 Know your application’s potential performance limits 3.2 Determine your hardware capabilities: benchmarking

3.2.1 Tools for gathering system characteristics 3.2.2 Calculating theoretical maximum FLOPS 3.2.3 The memory hierarchy and theoretical memory bandwidth 3.2.4 Empirical measurement of bandwidth and flops 3.2.5 Calculating the machine balance between flops and bandwidth

3.3 Characterizing your application: profiling

3.3.1 Profiling Tools 3.3.2 Empirical measurement of processor clock frequency and energy consumption 3.3.3 Tracking memory during runtime

3.4 Further explorations

3.4.1 Additional Reading 3.4.2 Exercises

3.5 Summary

4 Data design and performance models

4.1 Performance data structures: data-oriented design

4.1.1 Multidimensional arrays 4.1.2 Array of Structures (AOS) versus Structures of Arrays (SOA) 4.1.3 Array of Structure of Arrays (AOSOA)

4.2 Three C’s of cache misses: compulsory, capacity, conflict 4.3 Simple performance models: a case study

4.3.1 Full matrix data representations 4.3.2 Compressed sparse storage representations

4.4 Advanced performance models 4.5 Network messages 4.6 Further explorations

4.6.1 Additional reading 4.6.2 Exercises

4.7 Summary

5 Parallel algorithms and patterns

5.1 Algorithm analysis for parallel computing applications 5.2 Parallel algorithms: what are they? 5.3 What is a hash function? 5.4 Spatial hashing: a highly-parallel algorithm

5.4.1 Using perfect hashing for spatial mesh operations 5.4.2 Using compact hashing for spatial mesh operations

5.5 Prefix sum (scan) pattern and its importance in parallel computing

5.5.1 Step-efficient parallel scan operation 5.5.2 Work-efficient parallel scan operation 5.5.3 Parallel scan operations for large arrays

5.6 Parallel global sum: addressing the problem of associativity 5.7 Future of parallel algorithm research 5.8 Further explorations

5.8.1 Additional reading 5.8.2 Exercises

5.9 Summary

Part 2 6 Vectorization: FLOPs for free

6.1 Vectorization and Single-Instruction, Multiple-Data (SIMD) overview 6.2 Hardware trends 6.3 Vectorization methods

6.3.1 Optimized libraries provide performance for little effort 6.3.2 Auto-vectorization: the easy way to vectorization speed-up (most of the time*): 6.3.3 Teaching the compiler through hints -- pragmas and directives 6.3.4 Crappy loops, we got them: use vector intrinsics 6.3.5 Not for the faint of heart: using assembler code for vectorization

6.4 Programming style for better vectorization 6.5 Compiler flags relevant for vectorization for various compilers 6.6 OpenMP SIMD directives for better portability 6.7 Further explorations

6.7.1 Additional reading 6.7.2 Exercises

6.8 Summary

7 OpenMP that performs

7.1 OpenMP introduction

7.1.1 OpenMP concepts 7.1.2 A very simple OpenMP program

7.2 Typical OpenMP use cases: Loop-level, High-level, and MPI+OpenMP

7.2.1 Loop-level OpenMP for quick parallelization 7.2.2 High-level OpenMP for better parallel performance 7.2.3 MPI + OpenMP for extreme scalability

7.3 Examples of standard loop-level OpenMP

7.3.1 Loop level OpenMP: Vector addition example 7.3.2 Stream triad example 7.3.3 Loop level OpenMP: Stencil example 7.3.4 Performance of loop-level examples 7.3.5 Reduction example of a global sum using OpenMP threading 7.3.6 Potential loop-level OpenMP issues

7.4 Variable scope is critically important in OpenMP for correctness 7.5 Function-level OpenMP: making a whole function thread parallel 7.6 Improving parallel scalability with high-level OpenMP

7.6.1 How to implement high-level OpenMP 7.6.2 Example of implementing high-level OpenMP

7.7 Hybrid threading and vectorization with OpenMP 7.8 Advanced examples using OpenMP

7.8.1 Stencil example with a separate pass for the x and y directions 7.8.2 Kahan summation implementation with OpenMP threading 7.8.3 Threaded implementation of the prefix scan algorithm

7.9 Threading tools essential for robust implementations

7.9.1 Using Allinea-Map to get a quick high-level profile of your application 7.9.2 Finding your thread race conditions with Intel thread inspector

7.10 Example of task-based support algorithm 7.11 Further explorations

7.11.1 Additional reading 7.11.2 Exercises

7.12 Summary

8 MPI: the parallel backbone

8.1 The basics for an MPI program

8.1.1 Basic MPI function calls for every MPI program 8.1.2 Compiler wrappers for simpler MPI programs 8.1.3 Using parallel startup commands 8.1.4 Minimum working example of an MPI program

8.2 The send and receive commands for process-to-process communication 8.3 Collective communication: a powerful component of MPI

8.3.1 Using a barrier to synchronize timers 8.3.2 Using the broadcast to handle small file input 8.3.3 Using a reduction to get a single value from across all processes 8.3.4 Using gather to put order in debug printouts 8.3.5 Using scatter and gather to send data out to processes for work

8.4 Data parallel examples

8.4.1 Stream triad to measure bandwidth on the node 8.4.2 Ghost cell exchanges in a two-dimensional mesh 8.4.3 Ghost cell exchanges in a three-dimensional stencil calculation

8.5 Advanced MPI functionality to simplify code and enable optimizations

8.5.1 Using custom MPI datatypes for performance and code simplification 8.5.2 Cartesian topology support in MPI 8.5.3 Performance tests of ghost cell exchange variants

8.6 Hybrid MPI+OpenMP for extreme scalability

8.6.1 Hybrid MPI+OpenMP benefits 8.6.2 MPI+OpenMP example

8.7 Further explorations

8.7.1 Additional reading 8.7.2 Exercises

8.8 Summary

Part 3 9 GPU architectures and concepts

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: an underutilized option on commodity-based systems 9.1.2 Dedicated GPUs: the workhorse option

9.2 The GPU and the thread engine

9.2.1 The compute unit is the multiprocessor 9.2.2 Processing elements are the individual processors 9.2.3 Multiple data operations by each processing element 9.2.4 Calculating the peak theoretical flops for some leading GPUs

9.3 Characteristics of GPU memory spaces

9.3.1 Calculating theoretical peak memory bandwidth 9.3.2 Measuring the GPU stream benchmark 9.3.3 Roofline performance model for GPUs 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload

9.4 The PCI bus: CPU-GPU data transfer overhead

9.4.1 Theoretical Bandwidth of the PCI Bus 9.4.2 A Benchmark Application for PCI bandwidth

9.5 Multi-GPU platforms and MPI

9.5.1 Optimizing the data movement from one GPU across a network to another 9.5.2 A higher performance alternative to the PCI Bus

9.6 Potential benefits of GPU accelerated platforms

9.6.1 Reducing time-to-solution 9.6.2 Reducing energy use with GPUs 9.6.3 Reduction in cloud computing costs with GPUs

9.7 When to use GPUs 9.8 Further explorations

9.8.1 Additional reading 9.8.2 Exercises

9.9 Summary

10 GPU programming model

10.1 GPU programming abstractions: a common framework

10.1.1 Data decomposition into independent units of work: an NDRange or grid 10.1.2 Work groups provide a right-sized chunk of work 10.1.3 Subgroups, warps or wavefronts execute in lockstep 10.1.4 Work item: the basic unit of operation 10.1.5 SIMD or vector hardware

10.2 The code structure for the GPU programming model

10.2.1 Me programming: the concept of a parallel kernel 10.2.2 Thread indices: mapping the local tile to the global world 10.2.3 Index sets 10.2.4 How to address memory resources in your GPU programming model

10.3 Optimizing GPU resource usage

10.3.1 How many registers does my kernel use? 10.3.2 Occupancy - making more work available for work group scheduling

10.4 Reduction pattern requires synchronization across work groups 10.5 Asynchronous computing through queues (streams) 10.6 Developing a plan to parallelize an application for GPUs

10.6.1 Case 1: 3D atmospheric simulation 10.6.2 Case 2: Unstructured mesh application

10.7 Further explorations

10.7.1 Additional reading 10.7.2 Exercises

10.8 Summary

11 Directive-based GPU programming

11.1 Process to apply directives and pragmas for a GPU implementation 11.2 OpenACC: the easiest way to run on your GPU

11.2.1 Compiling OpenACC code 11.2.2 Parallel compute regions in OpenACC for accelerating computations 11.2.3 Using directives to reduce data movement between the CPU and the GPU 11.2.4 Optimizing the GPU kernels 11.2.5 Summary of performance results for stream triad 11.2.6 Advanced OpenACC techniques

11.3 OpenMP: the heavyweight champ enters the world of accelerators

11.3.1 Compiling OpenMP code 11.3.2 Generating parallel work on the GPU with OpenMP 11.3.3 Creating data regions to control data movement to the GPU with OpenMP 11.3.4 Optimizing OpenMP for GPUs 11.3.5 Advanced OpenMP for GPUs

11.4 Further explorations

11.4.1 Additional reading 11.4.2 Exercises

11.5 Summary

12 GPU languages: getting down to basics

12.1 Features of a native GPU programming language 12.2 CUDA and HIP GPU languages; the low-level performance option

12.2.1 Writing and building your first CUDA application 12.2.2 A reduction kernel in CUDA: life gets complicated 12.2.3 Hipifying the CUDA code

12.3 OpenCL for a portable open source GPU language

12.3.1 Writing and building your first OpenCL application 12.3.2 Reductions in OpenCL

12.4 SYCL: an experimental C++ implementation goes mainstream 12.5 Higher-level languages for performance portability

12.5.1 Kokkos: a performance portability ecosystem. 12.5.2 RAJA for a more adaptable performance portability layer

12.6 Further explorations

12.6.1 Additional reading 12.6.2 Exercises

12.7 Summary

13 GPU profiling and tools

13.1 Profiling tools overview 13.2 How to select a good workflow 13.3 Example problem: shallow water simulation 13.4 A sample of a profiling workflow

13.4.1 Run the shallow water code 13.4.2 Profile the CPU code: 13.4.3 Add OpenACC compute directives 13.4.4 Add data movement directives 13.4.5 The Nvidia Nsight suite of tools can be a powerful development aid 13.4.6 Code XL for the AMD GPU ecosystem

13.5 Don’t get lost in the swamp: focus on the important metrics

13.5.1 Occupancy 13.5.2 Issue efficiency 13.5.3 Achieved Bandwidth

13.6 Containers and Virtual Machines provide alternate workflows

13.6.1 Docker containers as a workaround 13.6.2 Virtual machines using Virtual Box

13.7 Cloud options: a flexible and portable capability 13.8 Further Explorations

13.8.1 Additional Reading 13.8.2 Exercises

13.9 Summary

Appendix A: References Appendix B: Solutions to Exercises Appendix C: Glossary

← Prev
Back
Next →

← Prev
Back
Next →