Intel® Xeon Phi(TM) Architecture and Tools · The Guide for Application Developers by Rahman, Rezaur -- Read -- Imperial Library of Trantor

Index

Cover Title Copyright About ApressOpen Dedication Contents at a Glance Contents About the Author About the Technical Reviewer Acknowledgments Introduction Part 1: Hardware Foundation: Intel Xeon Phi Architecture

Chapter 1: Introduction to Xeon Phi Architecture

History of Intel Xeon Phi Development

Evolution from Von Neumann Architecture to Cache Subsystem Architecture Improvements in the Core and Memory Interconnect and Cache Improvements

Intel Xeon Phi Coprocessor Chip Architecture Applicability of the Intel Xeon Phi Coprocessor Summary

Chapter 2: Programming Xeon Phi

Intel Xeon Phi Execution Models Development Tools for Intel Xeon Phi Architecture

Intel Composer XE

Setting Up an Intel Xeon Phi System

Install the MPSS Stack Install the Development Tools

Code Generation for Intel Xeon Phi Architecture

Native Execution Mode

Language Extensions to Support Offload Computation on Intel Xeon Phi

Heterogeneous Computing Model and Offload Pragmas Language Extensions and Execution Model Runtime Library Routines Offload Example

Summary

Chapter 3: Xeon Phi Vector Architecture and Instruction Set

Xeon Phi Vector Microarchitecture

The VPU Pipeline Vector Registers Vector Mask Registers Extended Math Unit

Xeon Phi Vector Instruction Set Architecture

Data Types Vector Nomenclature Vector Instruction Syntax Xeon Phi Vector ISA by Categories

Summary

Chapter 4: Xeon Phi Core Microarchitecture

Intel Xeon Phi Cores Core Pipeline Stages Cache and TLB Structure L2 Cache Structure Multithreading

Performance Considerations Probing the Core

Summary

Chapter 5: Xeon Phi Cache and Memory Subsystem

The Interconnect Topologies for Manycore Processors

Bidirectional Ring Topology Two-Dimensional Mesh Topology Two-Dimensional Torus Topology Other Topologies

The Ring Interconnect Architecture in Intel Xeon Phi L2 Cache

Tag Directory Data Transactions The Cache Coherency Protocol Hardware Prefetcher

Memory Transactions Flow

Cacheable Memory Read Transaction Managing Cache Hierarchy in Software

Probing the Memory Subsystem

Measuring the Memory Bandwidth on Intel Xeon Phi

Summary

Chapter 6: Xeon Phi PCIe Bus Data Transfer and Power Management

DMA Engine

Measuring the Data Transfer Bandwidth over the PCIe Bus

Reading Data from the Coprocessor Low-Level Data Transfer APIs for Intel Xeon Phi Placement of PCIe Cards for Optimal Data Transfer BW Power Management and Reliability

Idle Stare Management Reliability Availability and Serviceability Features in the Intel Xeon Phi Coprocessor

Summary

Part 2: Software Foundation: Intel Xeon Phi System Software and Tools

Chapter 7: Xeon Phi System Software

System Software Component Ring 0 Driver Layer Components of the MPSS

System Boot Process Coprocessor OS Creating a Third-Party Coprocessor OS mic0: Transition from State Booting to Online Host Driver Linux Virtual File System (Sysfs and Procfs) Networking on Xeon Phi Network File System Open Fabrics Enterprise Distribution and Message Passing Interface Support System Software Application Components

Summary

Chapter 8: Xeon Phi Application Development Tools

The Application Development Tools

Intel C/C++ Composer XE OpenMP 4.0 and Language Extensions Pragmas Asynchronous Data Transfer Over PCI Express

Keywords

Using Shared Virtual Memory Valid Use of the Keywords

Macros Intrinsics

C++ Class Libraries

Application Programming Interfaces

Environment Variables Compiler Options Creating Offload Libraries

Intel Fortran Composer XE

Directives Macros Application Programming Interfaces

Environment Variables, Compiler Options, and Creating Static Libraries

Third-Party Compilers Supporting Xeon Phi CAPS Compiler Debugging Xeon Phi Applications Intel Debugger Third-Party Debuggers

Optimization Tool: Intel Vtune Amplifier XE Libraries

Native or Symmetric Execution Compiler-Assisted Offload Using the Automatic Offload Version of the MKL Library Third-Party Math Libraries

Intel Cluster Tools

Third-Party Cluster Tools

Summary

Part 3: Applications: Technical Computing Software Development on Intel Xeon Phi

Chapter 9: Xeon Phi Application Design and Implementation Considerations

Workload-Related Considerations

Gustafson’s Law Scaled Speedup

Effect of Grid Shape on Performance

Algorithm Considerations Data Structure Offload Overhead Load Balancing

Implementation Considerations

Memory Management Mixed-Precision Arithmetic Optimizing Memory Transfer Bandwidth over the PCIe Bus Data Alignment Considerations Communication

Summary

Chapter 10: Application Performance Tuning on Xeon Phi

Getting Baseline Data Timing Applications Detecting Application Execution Bottlenecks

Some Basic Performance Events

Setting Target Performance Optimizing Code

Compiler-Driven Optimizations Data Alignment Removing Pointer Aliasing Streaming Store Using Large Pages Using Intel Cilk Plus Array Notation Vectorization with Intel Compiler

Using the Math Kernel Library Cluster-Level Tuning Summary

Chapter 11: Algorithm and Data Structures for Xeon Phi

Algorithm and Data Structure Design Rules for Xeon Phi General Matrix-Matrix Multiply Algorithm (GEMM)

Rules 1 and 3: Scalable Parallelization and Optimal Cache Reuse Rule 2: Efficient Vectorization

Molecular Dynamics

Rule 1: Scalable Parallelization Rules 2 and 3: Efficient Vectorization and Optimal Cache Reuse

Stencil Operation

Rule 1: Scalable Parallelization Rule 2: Efficient Vectorization Rule 3: Optimal Cache Reuse

European Option Pricing Using Monte Carlo Simulation in Financial Applications

Rule 1: Scalable Parallelization Rule 2: Efficient Vectorization Rule 3: Optimal Cache Reuse

Summary

Chapter 12: Xeon Phi Application Development on Windows OS

MPSS

MPSS Tools

Development Tools

Language Extensions for the Xeon Phi Coprocessor Offload Environment Variables

Debugging Offload Execution

Logging into Xeon Phi Console using PuTTY

Using VTune Amplifier XE to Profile Offload Code on Windows Building and Running Xeon Phi Native Applications from the Windows Host Summary

Appendix A: OpenCL on Xeon Phi

Installation Building and Running OpenCL Application Performance Optimization

Appendix B: Virtual Shared Memory Programming on Xeon Phi

Placing Data on the Virtual Shared Memory Region Shared Functions Synchronizing Between the Host and the Coprocessors

Index

← Prev
Back
Next →

← Prev
Back
Next →