Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
About This eBook
Title Page
Copyright Page
Contents at a Glance
Contents
Preface
About This Book
Acknowledgments
Part I Design: Building It
Part II Operations: Running It
Part III Appendices
About the Authors
Introduction
Business Objectives
Ideal System Architecture
Ideal Release Process
Ideal Operations
Part I: Design: Building It
Chapter 1. Designing in a Distributed World
1.1 Visibility at Scale
1.2 The Importance of Simplicity
1.3 Composition
1.3.1 Load Balancer with Multiple Backend Replicas
1.3.2 Server with Multiple Backends
1.3.3 Server Tree
1.4 Distributed State
1.5 The CAP Principle
1.5.1 Consistency
1.5.2 Availability
1.5.3 Partition Tolerance
1.6 Loosely Coupled Systems
1.7 Speed
1.8 Summary
Exercises
Chapter 2. Designing for Operations
2.1 Operational Requirements
2.1.1 Configuration
2.1.2 Startup and Shutdown
2.1.3 Queue Draining
2.1.4 Software Upgrades
2.1.5 Backups and Restores
2.1.6 Redundancy
2.1.7 Replicated Databases
2.1.8 Hot Swaps
2.1.9 Toggles for Individual Features
2.1.10 Graceful Degradation
2.1.11 Access Controls and Rate Limits
2.1.12 Data Import Controls
2.1.13 Monitoring
2.1.14 Auditing
2.1.15 Debug Instrumentation
2.1.16 Exception Collection
2.1.17 Documentation for Operations
2.2 Implementing Design for Operations
2.2.1 Build Features in from the Beginning
2.2.2 Request Features as They Are Identified
2.2.3 Write the Features Yourself
2.2.4 Work with a Third-Party Vendor
2.3 Improving the Model
2.4 Summary
Exercises
Chapter 3. Selecting a Service Platform
3.1 Level of Service Abstraction
3.1.1 Infrastructure as a Service
3.1.2 Platform as a Service
3.1.3 Software as a Service
3.2 Type of Machine
3.2.1 Physical Machines
3.2.2 Virtual Machines
3.2.3 Containers
3.3 Level of Resource Sharing
3.3.1 Compliance
3.3.2 Privacy
3.3.3 Cost
3.3.4 Control
3.4 Colocation
3.5 Selection Strategies
3.6 Summary
Exercises
Chapter 4. Application Architectures
4.1 Single-Machine Web Server
4.2 Three-Tier Web Service
4.2.1 Load Balancer Types
4.2.2 Load Balancing Methods
4.2.3 Load Balancing with Shared State
4.2.4 User Identity
4.2.5 Scaling
4.3 Four-Tier Web Service
4.3.1 Frontends
4.3.2 Application Servers
4.3.3 Configuration Options
4.4 Reverse Proxy Service
4.5 Cloud-Scale Service
4.5.1 Global Load Balancer
4.5.2 Global Load Balancing Methods
4.5.3 Global Load Balancing with User-Specific Data
4.5.4 Internal Backbone
4.6 Message Bus Architectures
4.6.1 Message Bus Designs
4.6.2 Message Bus Reliability
4.6.3 Example 1: Link-Shortening Site
4.6.4 Example 2: Employee Human Resources Data Updates
4.7 Service-Oriented Architecture
4.7.1 Flexibility
4.7.2 Support
4.7.3 Best Practices
4.8 Summary
Exercises
Chapter 5. Design Patterns for Scaling
5.1 General Strategy
5.1.1 Identify Bottlenecks
5.1.2 Reengineer Components
5.1.3 Measure Results
5.1.4 Be Proactive
5.2 Scaling Up
5.3 The AKF Scaling Cube
5.3.1 x: Horizontal Duplication
5.3.2 y: Functional or Service Splits
5.3.3 z: Lookup-Oriented Split
5.3.4 Combinations
5.4 Caching
5.4.1 Cache Effectiveness
5.4.2 Cache Placement
5.4.3 Cache Persistence
5.4.4 Cache Replacement Algorithms
5.4.5 Cache Entry Invalidation
5.4.6 Cache Size
5.5 Data Sharding
5.6 Threading
5.7 Queueing
5.7.1 Benefits
5.7.2 Variations
5.8 Content Delivery Networks
5.9 Summary
Exercises
Chapter 6. Design Patterns for Resiliency
6.1 Software Resiliency Beats Hardware Reliability
6.2 Everything Malfunctions Eventually
6.2.1 MTBF in Distributed Systems
6.2.2 The Traditional Approach
6.2.3 The Distributed Computing Approach
6.3 Resiliency through Spare Capacity
6.3.1 How Much Spare Capacity
6.3.2 Load Sharing versus Hot Spares
6.4 Failure Domains
6.5 Software Failures
6.5.1 Software Crashes
6.5.2 Software Hangs
6.5.3 Query of Death
6.6 Physical Failures
6.6.1 Parts and Components
6.6.2 Machines
6.6.3 Load Balancers
6.6.4 Racks
6.6.5 Datacenters
6.7 Overload Failures
6.7.1 Traffic Surges
6.7.2 DoS and DDoS Attacks
6.7.3 Scraping Attacks
6.8 Human Error
6.9 Summary
Exercises
Part II Operations: Running It
Chapter 7. Operations in a Distributed World
7.1 Distributed Systems Operations
7.1.1 SRE versus Traditional Enterprise IT
7.1.2 Change versus Stability
7.1.3 Defining SRE
7.1.4 Operations at Scale
7.2 Service Life Cycle
7.2.1 Service Launches
7.2.2 Service Decommissioning
7.3 Organizing Strategy for Operational Teams
7.3.1 Team Member Day Types
7.3.2 Other Strategies
7.4 Virtual Office
7.4.1 Communication Mechanisms
7.4.2 Communication Policies
7.5 Summary
Exercises
Chapter 8. DevOps Culture
8.1 What Is DevOps?
8.1.1 The Traditional Approach
8.1.2 The DevOps Approach
8.2 The Three Ways of DevOps
8.2.1 The First Way: Workflow
8.2.2 The Second Way: Improve Feedback
8.2.3 The Third Way: Continual Experimentation and Learning
8.2.4 Small Batches Are Better
8.2.5 Adopting the Strategies
8.3 History of DevOps
8.3.1 Evolution
8.3.2 Site Reliability Engineering
8.4 DevOps Values and Principles
8.4.1 Relationships
8.4.2 Integration
8.4.3 Automation
8.4.4 Continuous Improvement
8.4.5 Common Nontechnical DevOps Practices
8.4.6 Common Technical DevOps Practices
8.4.7 Release Engineering DevOps Practices
8.5 Converting to DevOps
8.5.1 Getting Started
8.5.2 DevOps at the Business Level
8.6 Agile and Continuous Delivery
8.6.1 What Is Agile?
8.6.2 What Is Continuous Delivery?
8.7 Summary
Exercises
Chapter 9. Service Delivery: The Build Phase
9.1 Service Delivery Strategies
9.1.1 Pattern: Modern DevOps Methodology
9.1.2 Anti-pattern: Waterfall Methodology
9.2 The Virtuous Cycle of Quality
9.3 Build-Phase Steps
9.3.1 Develop
9.3.2 Commit
9.3.3 Build
9.3.4 Package
9.3.5 Register
9.4 Build Console
9.5 Continuous Integration
9.6 Packages as Handoff Interface
9.7 Summary
Exercises
Chapter 10. Service Delivery: The Deployment Phase
10.1 Deployment-Phase Steps
10.1.1 Promotion
10.1.2 Installation
10.1.3 Configuration
10.2 Testing and Approval
10.2.1 Testing
10.2.2 Approval
10.3 Operations Console
10.4 Infrastructure Automation Strategies
10.4.1 Preparing Physical Machines
10.4.2 Preparing Virtual Machines
10.4.3 Installing OS and Services
10.5 Continuous Delivery
10.6 Infrastructure as Code
10.7 Other Platform Services
10.8 Summary
Exercises
Chapter 11. Upgrading Live Services
11.1 Taking the Service Down for Upgrading
11.2 Rolling Upgrades
11.3 Canary
11.4 Phased Roll-outs
11.5 Proportional Shedding
11.6 Blue-Green Deployment
11.7 Toggling Features
11.8 Live Schema Changes
11.9 Live Code Changes
11.10 Continuous Deployment
11.11 Dealing with Failed Code Pushes
11.12 Release Atomicity
11.13 Summary
Exercises
Chapter 12. Automation
12.1 Approaches to Automation
12.1.1 The Left-Over Principle
12.1.2 The Compensatory Principle
12.1.3 The Complementarity Principle
12.1.4 Automation for System Administration
12.1.5 Lessons Learned
12.2 Tool Building versus Automation
12.2.1 Example: Auto Manufacturing
12.2.2 Example: Machine Configuration
12.2.3 Example: Account Creation
12.2.4 Tools Are Good, But Automation Is Better
12.3 Goals of Automation
12.4 Creating Automation
12.4.1 Making Time to Automate
12.4.2 Reducing Toil
12.4.3 Determining What to Automate First
12.5 How to Automate
12.6 Language Tools
12.6.1 Shell Scripting Languages
12.6.2 Scripting Languages
12.6.3 Compiled Languages
12.6.4 Configuration Management Languages
12.7 Software Engineering Tools and Techniques
12.7.1 Issue Tracking Systems
12.7.2 Version Control Systems
12.7.3 Software Packaging
12.7.4 Style Guides
12.7.5 Test-Driven Development
12.7.6 Code Reviews
12.7.7 Writing Just Enough Code
12.8 Multitenant Systems
12.9 Summary
Exercises
Chapter 13. Design Documents
13.1 Design Documents Overview
13.1.1 Documenting Changes and Rationale
13.1.2 Documentation as a Repository of Past Decisions
13.2 Design Document Anatomy
13.3 Template
13.4 Document Archive
13.5 Review Workflows
13.5.1 Reviewers and Approvers
13.5.2 Achieving Sign-off
13.6 Adopting Design Documents
13.7 Summary
Exercises
Chapter 14. Oncall
14.1 Designing Oncall
14.1.1 Start with the SLA
14.1.2 Oncall Roster
14.1.3 Onduty
14.1.4 Oncall Schedule Design
14.1.5 The Oncall Calendar
14.1.6 Oncall Frequency
14.1.7 Types of Notifications
14.1.8 After-Hours Maintenance Coordination
14.2 Being Oncall
14.2.1 Pre-shift Responsibilities
14.2.2 Regular Oncall Responsibilities
14.2.3 Alert Responsibilities
14.2.4 Observe, Orient, Decide, Act (OODA)
14.2.5 Oncall Playbook
14.2.6 Third-Party Escalation
14.2.7 End-of-Shift Responsibilities
14.3 Between Oncall Shifts
14.3.1 Long-Term Fixes
14.3.2 Postmortems
14.4 Periodic Review of Alerts
14.5 Being Paged Too Much
14.6 Summary
Exercises
Chapter 15. Disaster Preparedness
15.1 Mindset
15.1.1 Antifragile Systems
15.1.2 Reducing Risk
15.2 Individual Training: Wheel of Misfortune
15.3 Team Training: Fire Drills
15.3.1 Service Testing
15.3.2 Random Testing
15.4 Training for Organizations: Game Day/DiRT
15.4.1 Getting Started
15.4.2 Increasing Scope
15.4.3 Implementation and Logistics
15.4.4 Experiencing a DiRT Test
15.5 Incident Command System
15.5.1 How It Works: Public Safety Arena
15.5.2 How It Works: IT Operations Arena
15.5.3 Incident Action Plan
15.5.4 Best Practices
15.5.5 ICS Example
15.6 Summary
Exercises
Chapter 16. Monitoring Fundamentals
16.1 Overview
16.1.1 Uses of Monitoring
16.1.2 Service Management
16.2 Consumers of Monitoring Information
16.3 What to Monitor
16.4 Retention
16.5 Meta-monitoring
16.6 Logs
16.6.1 Approach
16.6.2 Timestamps
16.7 Summary
Exercises
Chapter 17. Monitoring Architecture and Practice
17.1 Sensing and Measurement
17.1.1 Blackbox versus Whitebox Monitoring
17.1.2 Direct versus Synthesized Measurements
17.1.3 Rate versus Capability Monitoring
17.1.4 Gauges versus Counters
17.2 Collection
17.2.1 Push versus Pull
17.2.2 Protocol Selection
17.2.3 Server Component versus Agent versus Poller
17.2.4 Central versus Regional Collectors
17.3 Analysis and Computation
17.4 Alerting and Escalation Manager
17.4.1 Alerting, Escalation, and Acknowledgments
17.4.2 Silence versus Inhibit
17.5 Visualization
17.5.1 Percentiles
17.5.2 Stack Ranking
17.5.3 Histograms
17.6 Storage
17.7 Configuration
17.8 Summary
Exercises
Chapter 18. Capacity Planning
18.1 Standard Capacity Planning
18.1.1 Current Usage
18.1.2 Normal Growth
18.1.3 Planned Growth
18.1.4 Headroom
18.1.5 Resiliency
18.1.6 Timetable
18.2 Advanced Capacity Planning
18.2.1 Identifying Your Primary Resources
18.2.2 Knowing Your Capacity Limits
18.2.3 Identifying Your Core Drivers
18.2.4 Measuring Engagement
18.2.5 Analyzing the Data
18.2.6 Monitoring the Key Indicators
18.2.7 Delegating Capacity Planning
18.3 Resource Regression
18.4 Launching New Services
18.5 Reduce Provisioning Time
18.6 Summary
Exercises
Chapter 19. Creating KPIs
19.1 What Is a KPI?
19.2 Creating KPIs
19.2.1 Step 1: Envision the Ideal
19.2.2 Step 2: Quantify Distance to the Ideal
19.2.3 Step 3: Imagine How Behavior Will Change
19.2.4 Step 4: Revise and Select
19.2.5 Step 5: Deploy the KPI
19.3 Example KPI: Machine Allocation
19.3.1 The First Pass
19.3.2 The Second Pass
19.3.3 Evaluating the KPI
19.4 Case Study: Error Budget
19.4.1 Conflicting Goals
19.4.2 A Unified Goal
19.4.3 Everyone Benefits
19.5 Summary
Exercises
Chapter 20. Operational Excellence
20.1 What Does Operational Excellence Look Like?
20.2 How to Measure Greatness
20.3 Assessment Methodology
20.3.1 Operational Responsibilities
20.3.2 Assessment Levels
20.3.3 Assessment Questions and Look-For’s
20.4 Service Assessments
20.4.1 Identifying What to Assess
20.4.2 Assessing Each Service
20.4.3 Comparing Results across Services
20.4.4 Acting on the Results
20.4.5 Assessment and Project Planning Frequencies
20.5 Organizational Assessments
20.6 Levels of Improvement
20.7 Getting Started
20.8 Summary
Exercises
Epilogue
Part III Appendices
Appendix A. Assessments
A.1 Regular Tasks (RT)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.2 Emergency Response (ER)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.3 Monitoring and Metrics (MM)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.4 Capacity Planning (CP)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.5 Change Management (CM)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.6 New Product Introduction and Removal (NPI/NPR)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.7 Service Deployment and Decommissioning (SDD)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.8 Performance and Efficiency (PE)
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.9 Service Delivery: The Build Phase
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.10 Service Delivery: The Deployment Phase
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.11 Toil Reduction
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
A.12 Disaster Preparedness
Sample Assessment Questions
Level 1: Initial
Level 2: Repeatable
Level 3: Defined
Level 4: Managed
Level 5: Optimizing
Appendix B. The Origins and Future of Distributed Computing and Clouds
B.1 The Pre-Web Era (1985–1994)
Availability Requirements
Technology
Scaling
High Availability
Costs
B.2 The First Web Era: The Bubble (1995–2000)
Availability Requirements
Technology
Scaling
High Availability
N + 1 Configurations
N + 2 Configurations
Costs
B.3 The Dot-Bomb Era (2000–2003)
Availability Requirements
Technology
High Availability
Scaling
Data Scaling
Applicability
Costs
B.4 The Second Web Era (2003–2010)
Availability Requirements
Technology
High Availability
Scaling
Costs
B.5 The Cloud Computing Era (2010–present)
Availability Requirements
Costs
Scaling and High Availability
Technology
B.6 Conclusion
Exercises
Appendix C. Scaling Terminology and Concepts
C.1 Constant, Linear, and Exponential Scaling
C.2 Big O Notation
C.3 Limitations of Big O Notation
Appendix D. Templates and Examples
D.1 Design Document Template
D.2 Design Document Example
D.3 Sample Postmortem Template
Appendix E. Recommended Reading
DevOps:
ITIL:
Theory:
Classic Google Papers:
Classic Facebook Papers:
Scalability:
UNIX Internals:
UNIX Systems Programming:
Network Protocols:
Bibliography
Index
← Prev
Back
Next →
← Prev
Back
Next →