Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Cover
Title
Copyright
About ApressOpen
Dedication
Contents at a Glance
Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Part 1: Hardware Foundation: Intel Xeon Phi Architecture
Chapter 1: Introduction to Xeon Phi Architecture
History of Intel Xeon Phi Development
Evolution from Von Neumann Architecture to Cache Subsystem Architecture
Improvements in the Core and Memory
Interconnect and Cache Improvements
Intel Xeon Phi Coprocessor Chip Architecture
Applicability of the Intel Xeon Phi Coprocessor
Summary
Chapter 2: Programming Xeon Phi
Intel Xeon Phi Execution Models
Development Tools for Intel Xeon Phi Architecture
Intel Composer XE
Setting Up an Intel Xeon Phi System
Install the MPSS Stack
Install the Development Tools
Code Generation for Intel Xeon Phi Architecture
Native Execution Mode
Language Extensions to Support Offload Computation on Intel Xeon Phi
Heterogeneous Computing Model and Offload Pragmas
Language Extensions and Execution Model
Runtime Library Routines
Offload Example
Summary
Chapter 3: Xeon Phi Vector Architecture and Instruction Set
Xeon Phi Vector Microarchitecture
The VPU Pipeline
Vector Registers
Vector Mask Registers
Extended Math Unit
Xeon Phi Vector Instruction Set Architecture
Data Types
Vector Nomenclature
Vector Instruction Syntax
Xeon Phi Vector ISA by Categories
Summary
Chapter 4: Xeon Phi Core Microarchitecture
Intel Xeon Phi Cores
Core Pipeline Stages
Cache and TLB Structure
L2 Cache Structure
Multithreading
Performance Considerations
Probing the Core
Summary
Chapter 5: Xeon Phi Cache and Memory Subsystem
The Interconnect Topologies for Manycore Processors
Bidirectional Ring Topology
Two-Dimensional Mesh Topology
Two-Dimensional Torus Topology
Other Topologies
The Ring Interconnect Architecture in Intel Xeon Phi
L2 Cache
Tag Directory
Data Transactions
The Cache Coherency Protocol
Hardware Prefetcher
Memory Transactions Flow
Cacheable Memory Read Transaction
Managing Cache Hierarchy in Software
Probing the Memory Subsystem
Measuring the Memory Bandwidth on Intel Xeon Phi
Summary
Chapter 6: Xeon Phi PCIe Bus Data Transfer and Power Management
DMA Engine
Measuring the Data Transfer Bandwidth over the PCIe Bus
Reading Data from the Coprocessor
Low-Level Data Transfer APIs for Intel Xeon Phi
Placement of PCIe Cards for Optimal Data Transfer BW
Power Management and Reliability
Idle Stare Management
Reliability Availability and Serviceability Features in the Intel Xeon Phi Coprocessor
Summary
Part 2: Software Foundation: Intel Xeon Phi System Software and Tools
Chapter 7: Xeon Phi System Software
System Software Component
Ring 0 Driver Layer Components of the MPSS
System Boot Process
Coprocessor OS
Creating a Third-Party Coprocessor OS
mic0: Transition from State Booting to Online Host Driver
Linux Virtual File System (Sysfs and Procfs)
Networking on Xeon Phi
Network File System
Open Fabrics Enterprise Distribution and Message Passing Interface Support
System Software Application Components
Summary
Chapter 8: Xeon Phi Application Development Tools
The Application Development Tools
Intel C/C++ Composer XE
OpenMP 4.0 and Language Extensions
Pragmas
Asynchronous Data Transfer Over PCI Express
Keywords
Using Shared Virtual Memory
Valid Use of the Keywords
Macros
Intrinsics
C++ Class Libraries
Application Programming Interfaces
Environment Variables
Compiler Options
Creating Offload Libraries
Intel Fortran Composer XE
Directives
Macros
Application Programming Interfaces
Environment Variables, Compiler Options, and Creating Static Libraries
Third-Party Compilers Supporting Xeon Phi
CAPS Compiler
Debugging Xeon Phi Applications
Intel Debugger
Third-Party Debuggers
Optimization Tool: Intel Vtune Amplifier XE
Libraries
Native or Symmetric Execution
Compiler-Assisted Offload
Using the Automatic Offload Version of the MKL Library
Third-Party Math Libraries
Intel Cluster Tools
Third-Party Cluster Tools
Summary
Part 3: Applications: Technical Computing Software Development on Intel Xeon Phi
Chapter 9: Xeon Phi Application Design and Implementation Considerations
Workload-Related Considerations
Gustafson’s Law
Scaled Speedup
Effect of Grid Shape on Performance
Algorithm Considerations
Data Structure
Offload Overhead
Load Balancing
Implementation Considerations
Memory Management
Mixed-Precision Arithmetic
Optimizing Memory Transfer Bandwidth over the PCIe Bus
Data Alignment Considerations
Communication
Summary
Chapter 10: Application Performance Tuning on Xeon Phi
Getting Baseline Data
Timing Applications
Detecting Application Execution Bottlenecks
Some Basic Performance Events
Setting Target Performance
Optimizing Code
Compiler-Driven Optimizations
Data Alignment
Removing Pointer Aliasing
Streaming Store
Using Large Pages
Using Intel Cilk Plus Array Notation
Vectorization with Intel Compiler
Using the Math Kernel Library
Cluster-Level Tuning
Summary
Chapter 11: Algorithm and Data Structures for Xeon Phi
Algorithm and Data Structure Design Rules for Xeon Phi
General Matrix-Matrix Multiply Algorithm (GEMM)
Rules 1 and 3: Scalable Parallelization and Optimal Cache Reuse
Rule 2: Efficient Vectorization
Molecular Dynamics
Rule 1: Scalable Parallelization
Rules 2 and 3: Efficient Vectorization and Optimal Cache Reuse
Stencil Operation
Rule 1: Scalable Parallelization
Rule 2: Efficient Vectorization
Rule 3: Optimal Cache Reuse
European Option Pricing Using Monte Carlo Simulation in Financial Applications
Rule 1: Scalable Parallelization
Rule 2: Efficient Vectorization
Rule 3: Optimal Cache Reuse
Summary
Chapter 12: Xeon Phi Application Development on Windows OS
MPSS
MPSS Tools
Development Tools
Language Extensions for the Xeon Phi Coprocessor
Offload Environment Variables
Debugging Offload Execution
Logging into Xeon Phi Console using PuTTY
Using VTune Amplifier XE to Profile Offload Code on Windows
Building and Running Xeon Phi Native Applications from the Windows Host
Summary
Appendix A: OpenCL on Xeon Phi
Installation
Building and Running OpenCL Application
Performance Optimization
Appendix B: Virtual Shared Memory Programming on Xeon Phi
Placing Data on the Virtual Shared Memory Region
Shared Functions
Synchronizing Between the Host and the Coprocessors
Index
← Prev
Back
Next →
← Prev
Back
Next →