Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Foreword I
Foreword II
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. How SRE Relates to DevOps
Background on DevOps
No More Silos
Accidents Are Normal
Change Should Be Gradual
Tooling and Culture Are Interrelated
Measurement Is Crucial
Background on SRE
Operations Is a Software Problem
Manage by Service Level Objectives (SLOs)
Work to Minimize Toil
Automate This Year’s Job Away
Move Fast by Reducing the Cost of Failure
Share Ownership with Developers
Use the Same Tooling, Regardless of Function or Job Title
Compare and Contrast
Organizational Context and Fostering Successful Adoption
Narrow, Rigid Incentives Narrow Your Success
It’s Better to Fix It Yourself; Don’t Blame Someone Else
Consider Reliability Work as a Specialized Role
When Can Substitute for Whether
Strive for Parity of Esteem: Career and Financial
Conclusion
I. Foundations
2. Implementing SLOs
Why SREs Need SLOs
Getting Started
Reliability Targets and Error Budgets
What to Measure: Using SLIs
Types of components
A Worked Example
Moving from SLI Specification to SLI Implementation
API and HTTP server availability and latency
Pipeline freshness, coverage, and correctness
Measuring the SLIs
Load balancer metrics
Calculating the SLIs
Using the SLIs to Calculate Starter SLOs
Choosing an Appropriate Time Window
Getting Stakeholder Agreement
Establishing an Error Budget Policy
Documenting the SLO and Error Budget Policy
Dashboards and Reports
Continuous Improvement of SLO Targets
Improving the Quality of Your SLO
Decision Making Using SLOs and Error Budgets
Advanced Topics
Modeling User Journeys
Grading Interaction Importance
Modeling Dependencies
Experimenting with Relaxing Your SLOs
Conclusion
3. SLO Engineering Case Studies
Evernote’s SLO Story
Why Did Evernote Adopt the SRE Model?
Introduction of SLOs: A Journey in Progress
Breaking Down the SLO Wall Between Customer and Cloud Provider
Current State
The Home Depot’s SLO Story
The SLO Culture Project
Our First Set of SLOs
Availability and latency for API calls
Infrastructure utilization
Traffic volume
Latency
Errors
Tickets
VALET
Evangelizing SLOs
Automating VALET Data Collection
TPS Reports
VALET service
VALET Dashboard
The Proliferation of SLOs
Applying VALET to Batch Applications
Using VALET in Testing
Future Aspirations
Summary
Conclusion
4. Monitoring
Desirable Features of a Monitoring Strategy
Speed
Calculations
Interfaces
Alerts
Sources of Monitoring Data
Examples
Move information from logs to metrics
Problem
Proposed solution
Outcome
Improve both logs and metrics
Problem
Proposed solution
Outcome
Keep logs as the data source
Problem
Proposed solution
Outcome
Managing Your Monitoring System
Treat Your Configuration as Code
Encourage Consistency
Prefer Loose Coupling
Metrics with Purpose
Intended Changes
Dependencies
Saturation
Status of Served Traffic
Implementing Purposeful Metrics
Testing Alerting Logic
Conclusion
5. Alerting on SLOs
Alerting Considerations
Ways to Alert on Significant Events
1: Target Error Rate ≥ SLO Threshold
2: Increased Alert Window
3: Incrementing Alert Duration
4: Alert on Burn Rate
5: Multiple Burn Rate Alerts
6: Multiwindow, Multi-Burn-Rate Alerts
Low-Traffic Services and Error Budget Alerting
Generating Artificial Traffic
Combining Services
Making Service and Infrastructure Changes
Lowering the SLO or Increasing the Window
Extreme Availability Goals
Alerting at Scale
Conclusion
6. Eliminating Toil
What Is Toil?
Measuring Toil
Toil Taxonomy
Business Processes
Production Interrupts
Release Shepherding
Migrations
Cost Engineering and Capacity Planning
Troubleshooting for Opaque Architectures
Toil Management Strategies
Identify and Measure Toil
Engineer Toil Out of the System
Reject the Toil
Use SLOs to Reduce Toil
Start with Human-Backed Interfaces
Provide Self-Service Methods
Get Support from Management and Colleagues
Promote Toil Reduction as a Feature
Start Small and Then Improve
Increase Uniformity
Assess Risk Within Automation
Automate Toil Response
Use Open Source and Third-Party Tools
Use Feedback to Improve
Case Studies
Case Study 1: Reducing Toil in the Datacenter with Automation
Background
Problem Statement
What We Decided to Do
Design First Effort: Saturn Line-Card Repair
Implementation
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
Implementation
Lessons Learned
UIs should not introduce overhead or complexity
Don’t rely on human expertise
Design reusable components
Don’t overthink the problem
Sometimes imperfect automation is good enough
Repair automation is not fire and forget
Build in risk assessment and defense in depth
Get a failure budget and manager support
Think holistically
Case Study 2: Decommissioning Filer-Backed Home Directories
Background
Problem Statement
What We Decided to Do
Design and Implementation
Key Components
Moonwalk
Moira Portal
Archiving and migration automation
Lessons Learned
Challenge assumptions and retire expensive business processes
Build self-service interfaces
Start with human-backed interfaces
Melt snowflakes
Employ organizational nudges
Conclusion
7. Simplicity
Measuring Complexity
Simplicity Is End-to-End, and SREs Are Good for That
Case Study 1: End-to-End API Simplicity
Background
Lessons learned
Case Study 2: Project Lifecycle Complexity
Background
What we decided to do
Lessons learned
Regaining Simplicity
Case Study 3: Simplification of the Display Ads Spiderweb
Background
What we decided to do
Lessons learned
Case Study 4: Running Hundreds of Microservices on a Shared Platform
Background
What we decided to do
Design
Outcomes
Lessons learned
Case Study 5: pDNS No Longer Depends on Itself
Background
Problem statement
What we decided to do
Lessons learned
Conclusion
II. Practices
8. On-Call
Recap of “Being On-Call” Chapter of First SRE Book
Example On-Call Setups Within Google and Outside Google
Google: Forming a New Team
Initial scenario
Training roadmap
Afterword
Evernote: Finding Our Feet in the Cloud
Moving our on-prem infrastructure to the cloud
Adjusting our on-call policies and processes
Restructuring our monitoring and metrics
Tracking our performance over time
Engaging with CRE
Sustaining a self-perpetuating cycle
Practical Implementation Details
Anatomy of Pager Load
Scenario: A team in overload
Pager load inputs
Preexisting bugs
New bugs
Identification delay
Mitigation delay
Alerting
Rigor of follow-up
Data quality
Vigilance
On-Call Flexibility
Scenario: A change in personal circumstances
Automate on-call scheduling
Plan for short-term swaps
Plan for long-term breaks
Plan for part-time work schedules
On-Call Team Dynamics
Scenario: A culture of “survive the week”
Proposal one: Empower your ops engineers
Proposal two: Improve team relations
Conclusion
9. Incident Response
Incident Management at Google
Incident Command System
Main Roles in Incident Response
Case Studies
Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home
Context
Incident
Review
Case Study 2: Service Fault—Cache Me If You Can
Context
Incident
Review
What went well?
What could have been handled better?
Case Study 3: Power Outage—Lightning Never Strikes Twice…Until It Does
Context
Incident
Review
Case Study 4: Incident Response at PagerDuty
Major incident response at PagerDuty
Tools used for incident response
Putting Best Practices into Practice
Incident Response Training
Prepare Beforehand
Decide on a communication channel
Keep your audience informed
Prepare a list of contacts
Establish criteria for an incident
Drills
Conclusion
10. Postmortem Culture: Learning from Failure
Case Study
Bad Postmortem
Why Is This Postmortem Bad?
Missing context
Key details omitted
Key action item characteristics missing
Counterproductive finger pointing
Animated language
Missing ownership
Limited audience
Delayed publication
Good Postmortem
Why Is This Postmortem Better?
Clarity
Concrete action items
Blamelessness
Depth
Promptness
Conciseness
Organizational Incentives
Model and Enforce Blameless Behavior
Use blameless language
Include all incident participants in postmortem authoring
Gather feedback
Reward Postmortem Outcomes
Reward action item closeout
Reward positive organizational change
Highlight improved reliability
Hold up postmortem owners as leaders
Gamification
Share Postmortems Openly
Share announcements across the organization
Conduct cross-team reviews
Hold training exercises
Report incidents and outages weekly
Respond to Postmortem Culture Failures
Avoiding association
Failing to reinforce the culture
Lacking time to write postmortems
Repeating incidents
Tools and Templates
Postmortem Templates
Google’s template
Other industry templates
Postmortem Tooling
Postmortem creation
Postmortem checklist
Postmortem storage
Postmortem follow-up
Postmortem analysis
Other industry tools
Conclusion
11. Managing Load
Google Cloud Load Balancing
Anycast
Stabilized anycast
Maglev
Global Software Load Balancer
Google Front End
GCLB: Low Latency
GCLB: High Availability
Case Study 1: Pokémon GO on GCLB
Migrating to GCLB
Resolving the issue
Future-proofing
Autoscaling
Handling Unhealthy Machines
Working with Stateful Systems
Configuring Conservatively
Setting Constraints
Including Kill Switches and Manual Overrides
Avoiding Overloading Backends
Avoiding Traffic Imbalance
Combining Strategies to Manage Load
Case Study 2: When Load Shedding Attacks
What was happening?
What went wrong?
Lessons learned
Conclusion
12. Introducing Non-Abstract Large System Design
What Is NALSD?
Why “Non-Abstract”?
AdWords Example
Design Process
Initial Requirements
One Machine
Calculations
Evaluation
Distributed System
MapReduce
Evaluation
LogJoiner
Calculations
Sharded LogJoiner
Evaluation
Multidatacenter
Calculations
Evaluation
Conclusion
13. Data Processing Pipelines
Pipeline Applications
Event Processing/Data Transformation to Order or Structure Data
Data Analytics
Machine Learning
Pipeline Best Practices
Define and Measure Service Level Objectives
Data freshness
Data correctness
Data isolation/load balancing
End-to-end measurement
Plan for Dependency Failure
Create and Maintain Pipeline Documentation
System diagrams
Process documentation
Playbook entries
Map Your Development Lifecycle
Prototyping
Testing with a 1% dry run
Staging
Canarying
Performing a partial deployment
Deploying to production
Reduce Hotspotting and Workload Patterns
Implement Autoscaling and Resource Planning
Adhere to Access Control and Security Policies
Plan Escalation Paths
Pipeline Requirements and Design
What Features Do You Need?
Idempotent and Two-Phase Mutations
Checkpointing
Code Patterns
Reusing code
Using the microservice approach to creating pipelines
Pipeline Production Readiness
Pipeline maturity matrix
Pipeline Failures: Prevention and Response
Potential Failure Modes
Delayed data
Corrupt data
Potential Causes
Pipeline dependencies
Pipeline application or configuration
Unexpected resource growth
Region-level outage
Case Study: Spotify
Event Delivery
Event Delivery System Design and Architecture
Data collection
Extract Transform Load
Data delivery
Event Delivery System Operation
Timeliness
Skewness
Completeness
Customer Integration and Support
Documentation
System monitoring
Capacity planning
Development process
Incident handling
Summary
Conclusion
14. Configuration Design and Best Practices
What Is Configuration?
Configuration and Reliability
Separating Philosophy and Mechanics
Configuration Philosophy
Configuration Asks Users Questions
Questions Should Be Close to User Goals
Mandatory and Optional Questions
Escaping Simplicity
Mechanics of Configuration
Separate Configuration and Resulting Data
Importance of Tooling
Semantic validation
Configuration syntax
Ownership and Change Tracking
Safe Configuration Change Application
Conclusion
15. Configuration Specifics
Configuration-Induced Toil
Reducing Configuration-Induced Toil
Critical Properties and Pitfalls of Configuration Systems
Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
Pitfall 2: Designing Accidental or Ad Hoc Language Features
Pitfall 3: Building Too Much Domain-Specific Optimization
Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”
Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
Integrating a Configuration Language
Generating Config in Specific Formats
Driving Multiple Applications
Integrating an Existing Application: Kubernetes
What Kubernetes Provides
Example Kubernetes Config
Integrating the Configuration Language
Integrating Custom Applications (In-House Software)
Effectively Operating a Configuration System
Versioning
Source Control
Tooling
Testing
When to Evaluate Configuration
Very Early: Checking in the JSON
Pros
Cons
Middle of the Road: Evaluate at Build Time
Pros
Cons
Late: Evaluate at Runtime
Pros
Cons
Guarding Against Abusive Configuration
Conclusion
16. Canarying Releases
Release Engineering Principles
Balancing Release Velocity and Reliability
What Is Canarying?
Release Engineering and Canarying
Requirements of a Canary Process
Our Example Setup
A Roll Forward Deployment Versus a Simple Canary Deployment
Canary Implementation
Minimizing Risk to SLOs and the Error Budget
Choosing a Canary Population and Duration
Selecting and Evaluating Metrics
Metrics Should Indicate Problems
Metrics Should Be Representative and Attributable
Before/After Evaluation Is Risky
Use a Gradual Canary for Better Metric Selection
Dependencies and Isolation
Canarying in Noninteractive Systems
Requirements on Monitoring Data
Related Concepts
Blue/Green Deployment
Artificial Load Generation
Traffic Teeing
Conclusion
III. Processes
17. Identifying and Recovering from Overload
From Load to Overload
Case Study 1: Work Overload When Half a Team Leaves
Background
Problem Statement
What We Decided to Do
Implementation
Lessons Learned
Case Study 2: Perceived Overload After Organizational and Workload Changes
Background
Problem Statement
What We Decided to Do
Implementation
Short-term actions
Mid-term actions
Long-term actions
Effects
Lessons Learned
Strategies for Mitigating Overload
Recognizing the Symptoms of Overload
Reducing Overload and Restoring Team Health
Identify and alleviate psychosocial stressors
Prioritize and triage within one quarter
Protect yourself in the future
Conclusion
18. SRE Engagement Model
The Service Lifecycle
Phase 1: Architecture and Design
Phase 2: Active Development
Phase 3: Limited Availability
Phase 4: General Availability
Phase 5: Deprecation
Phase 6: Abandoned
Phase 7: Unsupported
Setting Up the Relationship
Communicating Business and Production Priorities
Identifying Risks
Aligning Goals
Setting Ground Rules
Planning and Executing
Sustaining an Effective Ongoing Relationship
Investing Time in Working Better Together
Maintaining an Open Line of Communication
Performing Regular Service Reviews
Reassessing When Ground Rules Start to Slip
Adjusting Priorities According to Your SLOs and Error Budget
Handling Mistakes Appropriately
Sleep on it
Meet in person (or as close to it as possible) to resolve issues
Be positive
Understand differences in communication
Scaling SRE to Larger Environments
Supporting Multiple Services with a Single SRE Team
Structuring a Multiple SRE Team Environment
Adapting SRE Team Structures to Changing Circumstances
Running Cohesive Distributed SRE Teams
Ending the Relationship
Case Study 1: Ares
Case Study 2: Data Analysis Pipeline
The pivot
Communication breakdown
Decommission
Conclusion
19. SRE: Reaching Beyond Your Walls
Truths We Hold to Be Self-Evident
Reliability Is the Most Important Feature
Your Users, Not Your Monitoring, Decide Your Reliability
If You Run a Platform, Then Reliability Is a Partnership
Everything Important Eventually Becomes a Platform
When Your Customers Have a Hard Time, You Have to Slow Down
You Will Need to Practice SRE with Your Customers
How to: SRE with Your Customers
Step 1: SLOs and SLIs Are How You Speak
Step 2: Audit the Monitoring and Build Shared Dashboards
Step 3: Measure and Renegotiate
Step 4: Design Reviews and Risk Analysis
Step 5: Practice, Practice, Practice
Be Thoughtful and Disciplined
Conclusion
20. SRE Team Lifecycles
SRE Practices Without SREs
Starting an SRE Role
Finding Your First SRE
Placing Your First SRE
Bootstrapping Your First SRE
Distributed SREs
Your First SRE Team
Forming
Creating a new team as part of a major project
Assembling a horizontal SRE team
Converting a team in place
Storming
Risks and mitigations
New team as part of a major project
Horizontal SRE team
A team converted in place
Norming
Performing
Partnering on architecture
Self-regulating workload
Making More SRE Teams
Service Complexity
Where to split
Pitfalls
SRE Rollout
Geographical Splits
Placement: How many time zones apart should the teams be?
People and projects: Seeding the team
Parity: Distributing Work Between Offices and Avoiding a “Night Shift”
Placement: What about having three shifts?
Timing: Should both halves of the team start at the same time?
Finance: Travel budget
Leadership: Joint ownership of a service
Suggested Practices for Running Many Teams
Mission Control
SRE Exchange
Training
Horizontal Projects
SRE Mobility
Travel
Launch Coordination Engineering Teams
Production Excellence
SRE Funding and Hiring
Conclusion
21. Organizational Change Management in SRE
SRE Embraces Change
Introduction to Change Management
Lewin’s Three-Stage Model
McKinsey’s 7-S Model
Kotter’s Eight-Step Process for Leading Change
The Prosci ADKAR Model
Emotion-Based Models
The Deming Cycle
How These Theories Apply to SRE
Case Study 1: Scaling Waze—From Ad Hoc to Planned Change
Background
The Messaging Queue: Replacing a System While Maintaining Reliability
The Next Cycle of Change: Improving the Deployment Process
Lessons Learned
Case Study 2: Common Tooling Adoption in SRE
Background
Problem Statement
What We Decided to Do
Design
Implementation: Monitoring
Lessons Learned
Conclusion
Conclusion
Onward…
The Future Belongs to the Past
SRE + <Insert Other Discipline>
Trickles, Streams, and Floods
SRE Belongs to All of Us
On Gratitude
A. Example SLO Document
Service Overview
SLIs and SLOs
Rationale
Error Budget
Clarifications and Caveats
B. Example Error Budget Policy
Service Overview
Goals
Non-Goals
SLO Miss Policy
Outage Policy
Escalation Policy
Background
C. Results of Postmortem Analysis
Index
← Prev
Back
Next →
← Prev
Back
Next →