Table of Contents

Inside Front Cover

In Praise of Computer Architecture: A Quantitative Approach Sixth Edition

Why We Wrote This Book

Topic Selection and Organization

An Overview of the Content

Navigating the Text

Chapter Structure

Case Studies With Exercises

Supplemental Materials

Helping Improve This Book

Concluding Remarks

Acknowledgments

1. Fundamentals of Quantitative Design and Analysis

1.1 Introduction

1.2 Classes of Computers

1.3 Defining Computer Architecture

1.4 Trends in Technology

1.5 Trends in Power and Energy in Integrated Circuits

1.6 Trends in Cost

1.7 Dependability

1.8 Measuring, Reporting, and Summarizing Performance

1.9 Quantitative Principles of Computer Design

1.10 Putting It All Together: Performance, Price, and Power

1.11 Fallacies and Pitfalls

1.12 Concluding Remarks

1.13 Historical Perspectives and References

Case Studies and Exercises by Diana Franklin

2. Memory Hierarchy Design

2.1 Introduction

2.2 Memory Technology and Optimizations

2.3 Ten Advanced Optimizations of Cache Performance

2.4 Virtual Memory and Virtual Machines

2.5 Cross-Cutting Issues: The Design of Memory Hierarchies

2.6 Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700

2.7 Fallacies and Pitfalls

2.8 Concluding Remarks: Looking Ahead

2.9 Historical Perspectives and References

Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li

3. Instruction-Level Parallelism and Its Exploitation

3.1 Instruction-Level Parallelism: Concepts and Challenges

3.2 Basic Compiler Techniques for Exposing ILP

3.3 Reducing Branch Costs With Advanced Branch Prediction

3.4 Overcoming Data Hazards With Dynamic Scheduling

3.5 Dynamic Scheduling: Examples and the Algorithm

3.6 Hardware-Based Speculation

3.7 Exploiting ILP Using Multiple Issue and Static Scheduling

3.8 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation

3.9 Advanced Techniques for Instruction Delivery and Speculation

3.10 Cross-Cutting Issues

3.11 Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

3.12 Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53

3.13 Fallacies and Pitfalls

3.14 Concluding Remarks: What's Ahead?

3.15 Historical Perspective and References

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell

4. Data-Level Parallelism in Vector, SIMD, and GPU Architectures

4.1 Introduction

4.2 Vector Architecture

4.3 SIMD Instruction Set Extensions for Multimedia

4.4 Graphics Processing Units

4.5 Detecting and Enhancing Loop-Level Parallelism

4.6 Cross-Cutting Issues

4.7 Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7

4.8 Fallacies and Pitfalls

4.9 Concluding Remarks

4.10 Historical Perspective and References

Case Study and Exercises by Jason D. Bakos

5. Thread-Level Parallelism

5.1 Introduction

5.2 Centralized Shared-Memory Architectures

5.3 Performance of Symmetric Shared-Memory Multiprocessors

5.4 Distributed Shared-Memory and Directory-Based Coherence

5.5 Synchronization: The Basics

5.6 Models of Memory Consistency: An Introduction

5.7 Cross-Cutting Issues

5.8 Putting It All Together: Multicore Processors and Their Performance

5.9 Fallacies and Pitfalls

5.10 The Future of Multicore Scaling

5.11 Concluding Remarks

5.12 Historical Perspectives and References

Case Studies and Exercises by Amr Zaky and David A. Wood

6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism

6.1 Introduction

6.2 Programming Models and Workloads for Warehouse-Scale Computers

6.3 Computer Architecture of Warehouse-Scale Computers

6.4 The Efficiency and Cost of Warehouse-Scale Computers

6.5 Cloud Computing: The Return of Utility Computing

6.6 Cross-Cutting Issues

6.7 Putting It All Together: A Google Warehouse-Scale Computer

6.8 Fallacies and Pitfalls

6.9 Concluding Remarks

6.10 Historical Perspectives and References

Case Studies and Exercises by Parthasarathy Ranganathan

7. Domain-Specific Architectures

7.1 Introduction

7.2 Guidelines for DSAs

7.3 Example Domain: Deep Neural Networks

7.4 Google’s Tensor Processing Unit, an Inference Data Center Accelerator

7.5 Microsoft Catapult, a Flexible Data Center Accelerator

7.6 Intel Crest, a Data Center Accelerator for Training

7.7 Pixel Visual Core, a Personal Mobile Device Image Processing Unit

7.8 Cross-Cutting Issues

7.9 Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators

7.10 Fallacies and Pitfalls

7.11 Concluding Remarks

7.12 Historical Perspectives and References

Case Studies and Exercises by Cliff Young

Appendix A. Instruction Set Principles

A.1 Introduction

A.2 Classifying Instruction Set Architectures

A.3 Memory Addressing

A.4 Type and Size of Operands

A.5 Operations in the Instruction Set

A.6 Instructions for Control Flow

A.7 Encoding an Instruction Set

A.8 Cross-Cutting Issues: The Role of Compilers

A.9 Putting It All Together: The RISC-V Architecture

A.10 Fallacies and Pitfalls

Appendix B. Review of Memory Hierarchy

B.1 Introduction

B.2 Cache Performance

B.3 Six Basic Cache Optimizations

B.4 Virtual Memory

B.5 Protection and Examples of Virtual Memory

B.6 Fallacies and Pitfalls

B.7 Concluding Remarks

B.8 Historical Perspective and References

Exercises by Amr Zaky

Appendix C. Pipelining: Basic and Intermediate Concepts

C.1 Introduction

C.2 The Major Hurdle of Pipelining—Pipeline Hazards

C.3 How Is Pipelining Implemented?

C.4 What Makes Pipelining Hard to Implement?

C.5 Extending the RISC V Integer Pipeline to Handle Multicycle Operations

C.6 Putting It All Together: The MIPS R4000 Pipeline

C.7 Cross-Cutting Issues

C.8 Fallacies and Pitfalls

C.9 Concluding Remarks

C.10 Historical Perspective and References

Updated Exercises by Diana Franklin

Appendix D. Storage Systems

D.1 Introduction

D.2 Advanced Topics in Disk Storage

D.3 Definition and Examples of Real Faults and Failures

D.4 I/O Performance, Reliability Measures, and Benchmarks

D.5 A Little Queuing Theory

D.6 Crosscutting Issues

D.7 Designing and Evaluating an I/O System—The Internet Archive Cluster

D.8 Putting It All Together: NetApp FAS6000 Filer

D.9 Fallacies and Pitfalls

D.10 Concluding Remarks

D.11 Historical Perspective and References

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau

Appendix E. Embedded Systems

E.1 Introduction

E.2 Signal Processing and Embedded Applications: The Digital Signal Processor

E.3 Embedded Benchmarks

E.4 Embedded Multiprocessors

E.5 Case Study: The Emotion Engine of the Sony PlayStation 2

E.6 Case Study: Sanyo VPC-SX500 Digital Camera

E.7 Case Study: Inside a Cell Phone

E.8 Concluding Remarks

Appendix F. Interconnection Networks

F.1 Introduction

F.2 Interconnecting Two Devices

F.3 Connecting More than Two Devices

F.4 Network Topology

F.5 Network Routing, Arbitration, and Switching

F.6 Switch Microarchitecture

F.7 Practical Issues for Commercial Interconnection Networks

F.8 Examples of Interconnection Networks

F.9 Internetworking

F.10 Crosscutting Issues for Interconnection Networks

F.11 Fallacies and Pitfalls

F.12 Concluding Remarks

F.13 Historical Perspective and References

Appendix G. Vector Processors in More Depth

G.1 Introduction

G.2 Vector Performance in More Depth

G.3 Vector Memory Systems in More Depth

G.4 Enhancing Vector Performance

G.5 Effectiveness of Compiler Vectorization

G.6 Putting It All Together: Performance of Vector Processors

G.7 A Modern Vector Supercomputer: The Cray X1

G.8 Concluding Remarks

G.9 Historical Perspective and References

Appendix H. Hardware and Software for VLIW and EPIC

H.1 Introduction: Exploiting Instruction-Level Parallelism Statically

H.2 Detecting and Enhancing Loop-Level Parallelism

H.3 Scheduling and Structuring Code for Parallelism

H.4 Hardware Support for Exposing Parallelism: Predicated Instructions

H.5 Hardware Support for Compiler Speculation

H.6 The Intel IA-64 Architecture and Itanium Processor

H.7 Concluding Remarks

Appendix I. Large-Scale Multiprocessors and Scientific Applications

I.1 Introduction

I.2 Interprocessor Communication: The Critical Performance Issue

I.3 Characteristics of Scientific Applications

I.4 Synchronization: Scaling Up

I.5 Performance of Scientific Applications on Shared-Memory Multiprocessors

I.6 Performance Measurement of Parallel Processors with Scientific Applications

I.7 Implementing Cache Coherence

I.8 The Custom Cluster Approach: Blue Gene/L

I.9 Concluding Remarks

Appendix J. Computer Arithmetic

J.1 Introduction

J.2 Basic Techniques of Integer Arithmetic

J.3 Floating Point

J.4 Floating-Point Multiplication

J.5 Floating-Point Addition

J.6 Division and Remainder

J.7 More on Floating-Point Arithmetic

J.8 Speeding Up Integer Addition

J.9 Speeding Up Integer Multiplication and Division

J.10 Putting It All Together

J.11 Fallacies and Pitfalls

J.12 Historical Perspective and References

Appendix K. Survey of Instruction Set Architectures

K.1 Introduction

K.2 A Survey of RISC Architectures for Desktop, Server, and Embedded Computers

K.3 The Intel 80x86

K.4 The VAX Architecture

K.5 The IBM 360/370 Architecture for Mainframe Computers

K.6 Historical Perspective and References

Acknowledgments

Appendix L. Advanced Concepts on Address Translation

Appendix M. Historical Perspectives and References

M.1 Introduction

M.2 The Early Development of Computers (Chapter 1)

M.3 The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B)

M.4 The Evolution of Instruction Sets (Appendices A, J, and K)

M.5 The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H)

M.6 The Development of SIMD Supercomputers, Vector Computers, Multimedia SIMD Instruction Extensions, and Graphical Processor Units (Chapter 4)

M.7 The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I)

M.8 The Development of Clusters (Chapter 6)

M.9 Historical Perspectives and References

M.10 The History of Magnetic Storage, RAID, and I/O Buses (Appendix D)

Inside Back Cover