Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Foreword
Preface
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
I. Introduction
1. Introduction
The Sysadmin Approach to Service Management
Google’s Approach to Service Management: Site Reliability Engineering
Tenets of SRE
Ensuring a Durable Focus on Engineering
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
Monitoring
Emergency Response
Change Management
Demand Forecasting and Capacity Planning
Provisioning
Efficiency and Performance
The End of the Beginning
2. The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That “Organizes” the Hardware
Managing Machines
Storage
Networking
Other System Software
Lock Service
Monitoring and Alerting
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Life of a Request
Job and Data Organization
II. Principles
3. Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Identifying the Risk Tolerance of Consumer Services
Target level of availability
Types of failures
Cost
Other service metrics
Identifying the Risk Tolerance of Infrastructure Services
Target level of availability
Types of failures
Cost
Example: Frontend infrastructure
Motivation for Error Budgets
Forming Your Error Budget
Benefits
4. Service Level Objectives
Service Level Terminology
Indicators
Objectives
Agreements
Indicators in Practice
What Do You and Your Users Care About?
Collecting Indicators
Aggregation
Standardize Indicators
Objectives in Practice
Defining Objectives
Choosing Targets
Control Measures
SLOs Set Expectations
Agreements in Practice
5. Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
6. Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Bigtable SRE: A Tale of Over-Alerting
Gmail: Predictable, Scriptable Responses from Humans
The Long Run
Conclusion
7. The Evolution of Automation at Google
The Value of Automation
Consistency
A Platform
Faster Repairs
Faster Action
Time Saving
The Value for Google SRE
The Use Cases for Automation
Google SRE’s Use Cases for Automation
A Hierarchy of Automation Classes
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Detecting Inconsistencies with Prodtest
Resolving Inconsistencies Idempotently
The Inclination to Specialize
Service-Oriented Cluster-Turnup
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
8. Release Engineering
The Role of a Release Engineer
Philosophy
Self-Service Model
High Velocity
Hermetic Builds
Enforcement of Policies and Procedures
Continuous Build and Deployment
Building
Branching
Testing
Packaging
Rapid
Deployment
Configuration Management
Conclusions
It’s Not Just for Googlers
Start Release Engineering at the Beginning
9. Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won’t Give Up My Code!
The “Negative Lines of Code” Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
III. Practices
10. Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Labels and Vectors
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On…
11. Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Balance in Quantity
Balance in Quality
Compensation
Feeling Safe
Avoiding Inappropriate Operational Load
Operational Overload
A Treacherous Enemy: Operational Underload
Conclusions
12. Effective Troubleshooting
Theory
In Practice
Problem Report
Triage
Examine
Diagnose
Simplify and reduce
Ask “what,” “where,” and “why”
What touched it last
Specific diagnoses
Test and Treat
Negative Results Are Magic
Cure
Case Study
Making Troubleshooting Easier
Conclusion
13. Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Details
Response
Findings
What went well
What we learned
Change-Induced Emergency
Details
Response
Findings
What went well
What we learned
Process-Induced Emergency
Details
Response
Findings
What went well
What we learned
All Problems Have Solutions
Learn from the Past. Don’t Repeat It.
Keep a History of Outages
Ask the Big, Even Improbable, Questions: What If…?
Encourage Proactive Testing
Conclusion
14. Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Sharp Focus on the Technical Problem
Poor Communication
Freelancing
Elements of Incident Management Process
Recursive Separation of Responsibilities
A Recognized Command Post
Live Incident State Document
Clear, Live Handoff
A Managed Incident
When to Declare an Incident
In Summary
15. Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
16. Tracking Outages
Escalator
Outalator
Aggregation
Tagging
Analysis
Reporting and communication
Unexpected Benefits
17. Testing for Reliability
Types of Software Testing
Traditional Tests
Unit tests
Integration tests
System tests
Production Tests
Configuration test
Stress test
Canary test
Creating a Test and Build Environment
Testing at Scale
Testing Scalable Tools
Testing Disaster
The Need for Speed
Pushing to Production
Expect Testing Fail
Integration
Production Probes
Conclusion
18. Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Traditional Capacity Planning
Brittle by nature
Laborious and imprecise
Our Solution: Intent-Based Capacity Planning
Intent-Based Capacity Planning
Precursors to Intent
Dependencies
Performance metrics
Prioritization
Introduction to Auxon
Requirements and Implementation: Successes and Lessons Learned
Approximation
Raising Awareness and Driving Adoption
Set expectations
Identify appropriate customers
Customer service
Designing at the right level
Team Dynamics
Fostering Software Engineering in SRE
Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
Getting There
Conclusions
19. Load Balancing at the Frontend
Power Isn’t the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
20. Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
A Simple Approach to Unhealthy Tasks: Flow Control
A Robust Approach to Unhealthy Tasks: Lame Duck State
Limiting the Connections Pool with Subsetting
Picking the Right Subset
A Subset Selection Algorithm: Random Subsetting
A Subset Selection Algorithm: Deterministic Subsetting
Load Balancing Policies
Simple Round Robin
Small subsetting
Varying query costs
Machine diversity
Unpredictable performance factors
Least-Loaded Round Robin
Weighted Round Robin
21. Handling Overload
The Pitfalls of “Queries per Second”
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Deciding to Retry
Load from Connections
Conclusions
22. Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Server Overload
Resource Exhaustion
CPU
Memory
Threads
File descriptors
Dependencies among resources
Service Unavailability
Preventing Server Overload
Queue Management
Load Shedding and Graceful Degradation
Retries
Latency and Deadlines
Picking a deadline
Missing deadlines
Deadline propagation
Bimodal latency
Slow Startup and Cold Caching
Always Go Downward in the Stack
Triggering Conditions for Cascading Failures
Process Death
Process Updates
New Rollouts
Organic Growth
Planned Changes, Drains, or Turndowns
Request profile changes
Resource limits
Testing for Cascading Failures
Test Until Failure and Beyond
Test Popular Clients
Test Noncritical Backends
Immediate Steps to Address Cascading Failures
Increase Resources
Stop Health Check Failures/Deaths
Restart Servers
Drop Traffic
Enter Degraded Modes
Eliminate Batch Load
Eliminate Bad Traffic
Closing Remarks
23. Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
Case Study 1: The Split-Brain Problem
Case Study 2: Failover Requires Human Intervention
Case Study 3: Faulty Group-Membership Algorithms
How Distributed Consensus Works
Paxos Overview: An Example Protocol
System Architecture Patterns for Distributed Consensus
Reliable Replicated State Machines
Reliable Replicated Datastores and Configuration Stores
Highly Available Processing Using Leader Election
Distributed Coordination and Locking Services
Reliable Distributed Queuing and Messaging
Distributed Consensus Performance
Multi-Paxos: Detailed Message Flow
Scaling Read-Heavy Workloads
Quorum Leases
Distributed Consensus Performance and Network Latency
Reasoning About Performance: Fast Paxos
Stable Leaders
Batching
Disk Access
Deploying Distributed Consensus-Based Systems
Number of Replicas
Location of Replicas
Capacity and Load Balancing
Quorum composition
Monitoring Distributed Consensus Systems
Conclusion
24. Distributed Periodic Scheduling with Cron
Cron
Introduction
Reliability Perspective
Cron Jobs and Idempotency
Cron at Large Scale
Extended Infrastructure
Extended Requirements
Building Cron at Google
Tracking the State of Cron Jobs
The Use of Paxos
The Roles of the Leader and the Follower
The leader
The follower
Resolving partial failures
Storing the State
Running Large Cron
Summary
25. Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Monitoring Problems in Periodic Pipelines
“Thundering Herd” Problems
Moiré Load Pattern
Introduction to Google Workflow
Workflow as Model-View-Controller Pattern
Stages of Execution in Workflow
Workflow Correctness Guarantees
Ensuring Business Continuity
Summary and Concluding Remarks
26. Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Choosing a Strategy for Superior Data Integrity
Backups Versus Archives
Requirements of the Cloud Environment in Perspective
Google SRE Objectives in Maintaining Data Integrity and Availability
Data Integrity Is the Means; Data Availability Is the Goal
Delivering a Recovery System, Rather Than a Backup System
Types of Failures That Lead to Data Loss
Challenges of Maintaining Data Integrity Deep and Wide
Scaling issues: Fulls, incrementals, and the competing forces of backups and restores
Retention
How Google SRE Faces the Challenges of Data Integrity
The 24 Combinations of Data Integrity Failure Modes
First Layer: Soft Deletion
Second Layer: Backups and Their Related Recovery Methods
Overarching Layer: Replication
1T Versus 1E: Not “Just” a Bigger Backup
Third Layer: Early Detection
Challenges faced by cloud developers
Out-of-band data validation
Knowing That Data Recovery Will Work
Case Studies
Gmail—February, 2011: Restore from GTape
Sunday, February 27, 2011, late in the evening
Google Music—March 2012: Runaway Deletion Detection
Tuesday, March 6th, 2012, mid-afternoon
Discovering the problem
Assessing the damage
Resolving the issue
Parallel bug identification and recovery efforts
First wave of recovery
Second wave of recovery
Addressing the root cause
General Principles of SRE as Applied to Data Integrity
Beginner’s Mind
Trust but Verify
Hope Is Not a Strategy
Defense in Depth
Conclusion
27. Reliable Product Launches at Scale
Launch Coordination Engineering
The Role of the Launch Coordination Engineer
Setting Up a Launch Process
The Launch Checklist
Driving Convergence and Simplification
Launching the Unexpected
Developing a Launch Checklist
Architecture and Dependencies
Example checklist questions
Example action items
Integration
Example action items
Capacity Planning
Example checklist questions
Failure Modes
Example checklist questions
Example action items
Client Behavior
Example checklist question
Example action items
Processes and Automation
Example checklist question
Example action items
Development Process
Example action items
External Dependencies
Example checklist questions
Rollout Planning
Example action items
Selected Techniques for Reliable Launches
Gradual and Staged Rollouts
Feature Flag Frameworks
Dealing with Abusive Client Behavior
Overload Behavior and Load Tests
Development of LCE
Evolution of the LCE Checklist
Problems LCE Didn’t Solve
Scalability changes
Growing operational load
Infrastructure churn
Conclusion
IV. Management
28. Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?
Initial Learning Experiences: The Case for Structure Over Chaos
Learning Paths That Are Cumulative and Orderly
Targeted Project Work, Not Menial Work
Creating Stellar Reverse Engineers and Improvisational Thinkers
Reverse Engineers: Figuring Out How Things Work
Statistical and Comparative Thinkers: Stewards of the Scientific Method Under Pressure
Improv Artists: When the Unexpected Happens
Tying This Together: Reverse Engineering a Production Service
Five Practices for Aspiring On-Callers
A Hunger for Failure: Reading and Sharing Postmortems
Disaster Role Playing
Break Real Things, Fix Real Things
Documentation as Apprenticeship
Shadow On-Call Early and Often
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
Closing Thoughts
29. Dealing with Interrupts
Managing Operational Load
Factors in Determining How Interrupts Are Handled
Imperfect Machines
Cognitive Flow State
Cognitive flow state: Creative and engaged
Cognitive flow state: Angry Birds
Do One Thing Well
Distractibility
Polarizing time
Seriously, Tell Me What to Do
General suggestions
On-call
Tickets
Ongoing responsibilities
Be on interrupts, or don’t be
Reducing Interrupts
Actually analyze tickets
Respect yourself, as well as your customers
30. Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Identify the Largest Sources of Stress
Identify Kindling
Phase 2: Sharing Context
Write a Good Postmortem for the Team
Sort Fires According to Type
Phase 3: Driving Change
Start with the Basics
Get Help Clearing Kindling
Explain Your Reasoning
Ask Leading Questions
Conclusion
31. Communication and Collaboration in SRE
Communications: Production Meetings
Agenda
Attendance
Collaboration within SRE
Team Composition
Techniques for Working Effectively
Case Study of Collaboration in SRE: Viceroy
The Coming of the Viceroy
Challenges
Recommendations
Collaboration Outside SRE
Case Study: Migrating DFP to F1
Conclusion
32. The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why
The PRR Model
The SRE Engagement Model
Alternative Support
Documentation
Consultation
Production Readiness Reviews: Simple PRR Model
Engagement
Analysis
Improvements and Refactoring
Training
Onboarding
Continuous Improvement
Evolving the Simple PRR Model: Early Engagement
Candidates for Early Engagement
Benefits of the Early Engagement Model
Design phase
Build and implementation
Launch
Post-launch
Disengaging from a service
Evolving Services Development: Frameworks and SRE Platform
Lessons Learned
External Factors Affecting SRE
Toward a Structural Solution: Frameworks
New Service and Management Benefits
Significantly lower operational overhead
Universal support by design
Faster, lower overhead engagements
A new engagement model based on shared responsibility
Conclusion
V. Conclusions
33. Lessons Learned from Other Industries
Meet Our Industry Veterans
Preparedness and Disaster Testing
Relentless Organizational Focus on Safety
Attention to Detail
Swing Capacity
Simulations and Live Drills
Training and Certification
Focus on Detailed Requirements Gathering and Design
Defense in Depth and Breadth
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
34. Conclusion
A. Availability Table
B. A Collection of Best Practices for Production Services
Fail Sanely
Progressive Rollouts
Define SLOs Like a User
Error Budgets
Monitoring
Postmortems
Capacity Planning
Overloads and Failure
SRE Teams
C. Example Incident State Document
D. Example Postmortem
Lessons Learned
What went well
What went wrong
Where we got lucky
Timeline
Supporting information:
E. Launch Coordination Checklist
F. Example Production Meeting Minutes
Bibliography
Index
← Prev
Back
Next →
← Prev
Back
Next →