Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Copyright 2020 Manning Publications welcome brief contents Part 1 1 Why parallel computing
1.1 Why should you learn about parallel computing?
1.1.1 What are the potential benefits of parallel computing? 1.1.2 Parallel computing cautions
1.2 The fundamental laws of parallel computing
1.2.1 The limit to parallel computing: Amdahl’s Law 1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law
1.3 How does parallel computing work?
1.3.1 Walk through a sample application 1.3.2 A hardware model for today’s heterogeneous parallel systems 1.3.3 The application/software model for today’s heterogeneous parallel systems
1.4 Categorizing parallel approaches 1.5 Parallel Strategies 1.6 Parallel speedup vs comparative speedups: two different measures 1.7 What will you learn in this book?
1.7.1 Exercises
1.8 Summary
2 Planning for parallel
2.1 Approaching a new project: the preparation
2.1.1 Version Control: creating a safety vault for your parallel code 2.1.2 Test Suites: the first step to creating a robust, reliable application 2.1.3 Finding and fixing memory issues 2.1.4 Improving code portability
2.2 Profiling step: probing the gap between system capabilities and application performance 2.3 Planning step: a foundation for success
2.3.1 Exploring with benchmarks and mini-apps 2.3.2 Design of the core data structures and code modularity 2.3.3 Algorithms: redesign for parallel
2.4 Implementation step: where it all happens 2.5 Commit Step: wrapping it up with quality 2.6 Further explorations
2.6.1 Additional Reading 2.6.2 Exercises
2.7 Summary
3 Performance limits and profiling
3.1 Know your application’s potential performance limits 3.2 Determine your hardware capabilities: benchmarking
3.2.1 Tools for gathering system characteristics 3.2.2 Calculating theoretical maximum FLOPS 3.2.3 The memory hierarchy and theoretical memory bandwidth 3.2.4 Empirical measurement of bandwidth and flops 3.2.5 Calculating the machine balance between flops and bandwidth
3.3 Characterizing your application: profiling
3.3.1 Profiling Tools 3.3.2 Empirical measurement of processor clock frequency and energy consumption 3.3.3 Tracking memory during runtime
3.4 Further explorations
3.4.1 Additional Reading 3.4.2 Exercises
3.5 Summary
4 Data design and performance models
4.1 Performance data structures: data-oriented design
4.1.1 Multidimensional arrays 4.1.2 Array of Structures (AOS) versus Structures of Arrays (SOA) 4.1.3 Array of Structure of Arrays (AOSOA)
4.2 Three C’s of cache misses: compulsory, capacity, conflict 4.3 Simple performance models: a case study
4.3.1 Full matrix data representations 4.3.2 Compressed sparse storage representations
4.4 Advanced performance models 4.5 Network messages 4.6 Further explorations
4.6.1 Additional reading 4.6.2 Exercises
4.7 Summary
5 Parallel algorithms and patterns
5.1 Algorithm analysis for parallel computing applications 5.2 Parallel algorithms: what are they? 5.3 What is a hash function? 5.4 Spatial hashing: a highly-parallel algorithm
5.4.1 Using perfect hashing for spatial mesh operations 5.4.2 Using compact hashing for spatial mesh operations
5.5 Prefix sum (scan) pattern and its importance in parallel computing
5.5.1 Step-efficient parallel scan operation 5.5.2 Work-efficient parallel scan operation 5.5.3 Parallel scan operations for large arrays
5.6 Parallel global sum: addressing the problem of associativity 5.7 Future of parallel algorithm research 5.8 Further explorations
5.8.1 Additional reading 5.8.2 Exercises
5.9 Summary
Part 2 6 Vectorization: FLOPs for free
6.1 Vectorization and Single-Instruction, Multiple-Data (SIMD) overview 6.2 Hardware trends 6.3 Vectorization methods
6.3.1 Optimized libraries provide performance for little effort 6.3.2 Auto-vectorization: the easy way to vectorization speed-up (most of the time*): 6.3.3 Teaching the compiler through hints -- pragmas and directives 6.3.4 Crappy loops, we got them: use vector intrinsics 6.3.5 Not for the faint of heart: using assembler code for vectorization
6.4 Programming style for better vectorization 6.5 Compiler flags relevant for vectorization for various compilers 6.6 OpenMP SIMD directives for better portability 6.7 Further explorations
6.7.1 Additional reading 6.7.2 Exercises
6.8 Summary
7 OpenMP that performs
7.1 OpenMP introduction
7.1.1 OpenMP concepts 7.1.2 A very simple OpenMP program
7.2 Typical OpenMP use cases: Loop-level, High-level, and MPI+OpenMP
7.2.1 Loop-level OpenMP for quick parallelization 7.2.2 High-level OpenMP for better parallel performance 7.2.3 MPI + OpenMP for extreme scalability
7.3 Examples of standard loop-level OpenMP
7.3.1 Loop level OpenMP: Vector addition example 7.3.2 Stream triad example 7.3.3 Loop level OpenMP: Stencil example 7.3.4 Performance of loop-level examples 7.3.5 Reduction example of a global sum using OpenMP threading 7.3.6 Potential loop-level OpenMP issues
7.4 Variable scope is critically important in OpenMP for correctness 7.5 Function-level OpenMP: making a whole function thread parallel 7.6 Improving parallel scalability with high-level OpenMP
7.6.1 How to implement high-level OpenMP 7.6.2 Example of implementing high-level OpenMP
7.7 Hybrid threading and vectorization with OpenMP 7.8 Advanced examples using OpenMP
7.8.1 Stencil example with a separate pass for the x and y directions 7.8.2 Kahan summation implementation with OpenMP threading 7.8.3 Threaded implementation of the prefix scan algorithm
7.9 Threading tools essential for robust implementations
7.9.1 Using Allinea-Map to get a quick high-level profile of your application 7.9.2 Finding your thread race conditions with Intel thread inspector
7.10 Example of task-based support algorithm 7.11 Further explorations
7.11.1 Additional reading 7.11.2 Exercises
7.12 Summary
8 MPI: the parallel backbone
8.1 The basics for an MPI program
8.1.1 Basic MPI function calls for every MPI program 8.1.2 Compiler wrappers for simpler MPI programs 8.1.3 Using parallel startup commands 8.1.4 Minimum working example of an MPI program
8.2 The send and receive commands for process-to-process communication 8.3 Collective communication: a powerful component of MPI
8.3.1 Using a barrier to synchronize timers 8.3.2 Using the broadcast to handle small file input 8.3.3 Using a reduction to get a single value from across all processes 8.3.4 Using gather to put order in debug printouts 8.3.5 Using scatter and gather to send data out to processes for work
8.4 Data parallel examples
8.4.1 Stream triad to measure bandwidth on the node 8.4.2 Ghost cell exchanges in a two-dimensional mesh 8.4.3 Ghost cell exchanges in a three-dimensional stencil calculation
8.5 Advanced MPI functionality to simplify code and enable optimizations
8.5.1 Using custom MPI datatypes for performance and code simplification 8.5.2 Cartesian topology support in MPI 8.5.3 Performance tests of ghost cell exchange variants
8.6 Hybrid MPI+OpenMP for extreme scalability
8.6.1 Hybrid MPI+OpenMP benefits 8.6.2 MPI+OpenMP example
8.7 Further explorations
8.7.1 Additional reading 8.7.2 Exercises
8.8 Summary
Part 3 9 GPU architectures and concepts
9.1 The CPU-GPU system as an accelerated computational platform
9.1.1 Integrated GPUs: an underutilized option on commodity-based systems 9.1.2 Dedicated GPUs: the workhorse option
9.2 The GPU and the thread engine
9.2.1 The compute unit is the multiprocessor 9.2.2 Processing elements are the individual processors 9.2.3 Multiple data operations by each processing element 9.2.4 Calculating the peak theoretical flops for some leading GPUs
9.3 Characteristics of GPU memory spaces
9.3.1 Calculating theoretical peak memory bandwidth 9.3.2 Measuring the GPU stream benchmark 9.3.3 Roofline performance model for GPUs 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
9.4 The PCI bus: CPU-GPU data transfer overhead
9.4.1 Theoretical Bandwidth of the PCI Bus 9.4.2 A Benchmark Application for PCI bandwidth
9.5 Multi-GPU platforms and MPI
9.5.1 Optimizing the data movement from one GPU across a network to another 9.5.2 A higher performance alternative to the PCI Bus
9.6 Potential benefits of GPU accelerated platforms
9.6.1 Reducing time-to-solution 9.6.2 Reducing energy use with GPUs 9.6.3 Reduction in cloud computing costs with GPUs
9.7 When to use GPUs 9.8 Further explorations
9.8.1 Additional reading 9.8.2 Exercises
9.9 Summary
10 GPU programming model
10.1 GPU programming abstractions: a common framework
10.1.1 Data decomposition into independent units of work: an NDRange or grid 10.1.2 Work groups provide a right-sized chunk of work 10.1.3 Subgroups, warps or wavefronts execute in lockstep 10.1.4 Work item: the basic unit of operation 10.1.5 SIMD or vector hardware
10.2 The code structure for the GPU programming model
10.2.1 Me programming: the concept of a parallel kernel 10.2.2 Thread indices: mapping the local tile to the global world 10.2.3 Index sets 10.2.4 How to address memory resources in your GPU programming model
10.3 Optimizing GPU resource usage
10.3.1 How many registers does my kernel use? 10.3.2 Occupancy - making more work available for work group scheduling
10.4 Reduction pattern requires synchronization across work groups 10.5 Asynchronous computing through queues (streams) 10.6 Developing a plan to parallelize an application for GPUs
10.6.1 Case 1: 3D atmospheric simulation 10.6.2 Case 2: Unstructured mesh application
10.7 Further explorations
10.7.1 Additional reading 10.7.2 Exercises
10.8 Summary
11 Directive-based GPU programming
11.1 Process to apply directives and pragmas for a GPU implementation 11.2 OpenACC: the easiest way to run on your GPU
11.2.1 Compiling OpenACC code 11.2.2 Parallel compute regions in OpenACC for accelerating computations 11.2.3 Using directives to reduce data movement between the CPU and the GPU 11.2.4 Optimizing the GPU kernels 11.2.5 Summary of performance results for stream triad 11.2.6 Advanced OpenACC techniques
11.3 OpenMP: the heavyweight champ enters the world of accelerators
11.3.1 Compiling OpenMP code 11.3.2 Generating parallel work on the GPU with OpenMP 11.3.3 Creating data regions to control data movement to the GPU with OpenMP 11.3.4 Optimizing OpenMP for GPUs 11.3.5 Advanced OpenMP for GPUs
11.4 Further explorations
11.4.1 Additional reading 11.4.2 Exercises
11.5 Summary
12 GPU languages: getting down to basics
12.1 Features of a native GPU programming language 12.2 CUDA and HIP GPU languages; the low-level performance option
12.2.1 Writing and building your first CUDA application 12.2.2 A reduction kernel in CUDA: life gets complicated 12.2.3 Hipifying the CUDA code
12.3 OpenCL for a portable open source GPU language
12.3.1 Writing and building your first OpenCL application 12.3.2 Reductions in OpenCL
12.4 SYCL: an experimental C++ implementation goes mainstream 12.5 Higher-level languages for performance portability
12.5.1 Kokkos: a performance portability ecosystem. 12.5.2 RAJA for a more adaptable performance portability layer
12.6 Further explorations
12.6.1 Additional reading 12.6.2 Exercises
12.7 Summary
13 GPU profiling and tools
13.1 Profiling tools overview 13.2 How to select a good workflow 13.3 Example problem: shallow water simulation 13.4 A sample of a profiling workflow
13.4.1 Run the shallow water code 13.4.2 Profile the CPU code: 13.4.3 Add OpenACC compute directives 13.4.4 Add data movement directives 13.4.5 The Nvidia Nsight suite of tools can be a powerful development aid 13.4.6 Code XL for the AMD GPU ecosystem
13.5 Don’t get lost in the swamp: focus on the important metrics
13.5.1 Occupancy 13.5.2 Issue efficiency 13.5.3 Achieved Bandwidth
13.6 Containers and Virtual Machines provide alternate workflows
13.6.1 Docker containers as a workaround 13.6.2 Virtual machines using Virtual Box
13.7 Cloud options: a flexible and portable capability 13.8 Further Explorations
13.8.1 Additional Reading 13.8.2 Exercises
13.9 Summary
Appendix A: References Appendix B: Solutions to Exercises Appendix C: Glossary
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion