Intel® Xeon Phi™ Coprocessor High-Performance Programming by Jeffers, Jim -- Read -- Imperial Library of Trantor

Index

Cover image Title page Table of Contents Copyright Foreword Preface

Organization Lots-of-cores.com

Acknowledgements Chapter 1. Introduction

Trend: more parallelism Why Intel® Xeon Phi™ coprocessors are needed Platforms with coprocessors The first Intel® Xeon Phi™ coprocessor Keeping the “Ninja Gap” under control Transforming-and-tuning double advantage When to use an Intel® Xeon Phi™ coprocessor Maximizing performance on processors first Why scaling past one hundred threads is so important Maximizing parallel program performance Measuring readiness for highly parallel execution What about GPUs? Beyond the ease of porting to increased performance Transformation for performance Hyper-threading versus multithreading Coprocessor major usage model: MPI versus offload Compiler and programming models Cache optimizations Examples, then details For more information

Chapter 2. High Performance Closed Track Test Drive!

Chapter 3. A Friendly Country Road Race

Preparing for our country road trip: chapter focus Getting a feel for the road: the 9-point stencil algorithm At the starting line: the baseline 9-point stencil implementation Rough road ahead: running the baseline stencil code Cobblestone street ride: vectors but not yet scaling Open road all-out race: vectors plus scaling Some grease and wrenches!: a bit of tuning Summary For more information

Chapter 4. Driving Around Town: Optimizing A Real-World Code Example

Chapter 5. Lots of Data (Vectors)

Why vectorize? How to vectorize Five approaches to achieving vectorization Six step vectorization methodology Streaming through caches: data layout, alignment, prefetching, and so on Compiler tips Compiler options Compiler directives Use array sections to encourage vectorization Look at what the compiler created: assembly code inspection Numerical result variations with vectorization Summary For more information

Chapter 6. Lots of Tasks (not Threads)

OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL OpenMP Fortran 2008 Intel® TBB Cilk Plus Summary For more information

Chapter 7. Offload

Two offload models Choosing offload vs. native execution Language extensions for offload Using pragma/directive offload Using offload with shared virtual memory About asynchronous computation About asynchronous data transfer Applying the target attribute to multiple declarations Performing file I/O on the coprocessor Logging stdout and stderr from offloaded code Summary For more information

Chapter 8. Coprocessor Architecture

The Intel® Xeon Phi™ coprocessor family Coprocessor card design Intel® Xeon Phi™ coprocessor silicon overview Individual coprocessor core architecture Instruction and multithread processing Cache organization and memory access considerations Prefetching Vector processing unit architecture Coprocessor PCIe system interface and DMA Coprocessor power management capabilities Reliability, availability, and serviceability (RAS) Coprocessor system management controller (SMC) Benchmarks Summary For more information

Chapter 9. Coprocessor System Software

Coprocessor software architecture overview Coprocessor programming models and options Coprocessor software architecture components Intel® manycore platform software stack Linux support for Intel® Xeon Phi™ coprocessors Tuning memory allocation performance Summary For more information

Chapter 10. Linux on the Coprocessor

Coprocessor Linux baseline Introduction to coprocessor Linux bootstrap and configuration Default coprocessor Linux configuration Changing coprocessor configuration The micctrl utility Adding software Coprocessor Linux boot process Coprocessors in a Linux cluster Summary For more information

Chapter 11. Math Library

Intel Math Kernel Library overview Intel MKL and Intel compiler Coprocessor support overview Using the coprocessor in native mode Using automatic offload mode Using compiler-assisted offload Precision choices and variations Summary For more information

Chapter 12. MPI

MPI overview Using MPI on Intel® Xeon PhiTM coprocessors Prerequisites (batteries not included) Offload from an MPI rank Using MPI natively on the coprocessor Summary For more information

Chapter 13. Profiling and Timing

Event monitoring registers on the coprocessor Efficiency metrics Potential performance issues Intel® VTune™ Amplifier XE product Performance application programming interface MPI analysis: Intel Trace Analyzer and Collector Timing Summary For more information

Chapter 14. Summary

Advice Additional resources Another book coming? Feedback appreciated

Glossary Index