The Site Reliability Workbook by Beyer, Betsy -- Read -- Imperial Library of Trantor

Index

Foreword I Foreword II Preface

Conventions Used in This Book Using Code Examples O’Reilly Safari How to Contact Us Acknowledgments

1. How SRE Relates to DevOps

Background on DevOps

No More Silos Accidents Are Normal Change Should Be Gradual Tooling and Culture Are Interrelated Measurement Is Crucial

Background on SRE

Operations Is a Software Problem Manage by Service Level Objectives (SLOs) Work to Minimize Toil Automate This Year’s Job Away Move Fast by Reducing the Cost of Failure Share Ownership with Developers Use the Same Tooling, Regardless of Function or Job Title

Compare and Contrast Organizational Context and Fostering Successful Adoption

Narrow, Rigid Incentives Narrow Your Success It’s Better to Fix It Yourself; Don’t Blame Someone Else Consider Reliability Work as a Specialized Role When Can Substitute for Whether Strive for Parity of Esteem: Career and Financial

Conclusion

I. Foundations 2. Implementing SLOs

Why SREs Need SLOs Getting Started

Reliability Targets and Error Budgets What to Measure: Using SLIs

Types of components

A Worked Example

Moving from SLI Specification to SLI Implementation

API and HTTP server availability and latency Pipeline freshness, coverage, and correctness

Measuring the SLIs

Load balancer metrics Calculating the SLIs

Using the SLIs to Calculate Starter SLOs

Choosing an Appropriate Time Window Getting Stakeholder Agreement

Establishing an Error Budget Policy Documenting the SLO and Error Budget Policy Dashboards and Reports

Continuous Improvement of SLO Targets

Improving the Quality of Your SLO

Decision Making Using SLOs and Error Budgets Advanced Topics

Modeling User Journeys Grading Interaction Importance Modeling Dependencies Experimenting with Relaxing Your SLOs

Conclusion

3. SLO Engineering Case Studies

Evernote’s SLO Story

Why Did Evernote Adopt the SRE Model? Introduction of SLOs: A Journey in Progress Breaking Down the SLO Wall Between Customer and Cloud Provider Current State

The Home Depot’s SLO Story

The SLO Culture Project Our First Set of SLOs

Availability and latency for API calls Infrastructure utilization Traffic volume Latency Errors Tickets VALET

Evangelizing SLOs Automating VALET Data Collection

TPS Reports VALET service VALET Dashboard

The Proliferation of SLOs Applying VALET to Batch Applications Using VALET in Testing Future Aspirations Summary

Conclusion

4. Monitoring

Desirable Features of a Monitoring Strategy

Speed Calculations Interfaces Alerts

Sources of Monitoring Data

Examples

Move information from logs to metrics

Problem Proposed solution Outcome

Improve both logs and metrics

Problem Proposed solution Outcome

Keep logs as the data source

Problem Proposed solution Outcome

Managing Your Monitoring System

Treat Your Configuration as Code Encourage Consistency Prefer Loose Coupling

Metrics with Purpose

Intended Changes Dependencies Saturation Status of Served Traffic Implementing Purposeful Metrics

Testing Alerting Logic Conclusion

5. Alerting on SLOs

Alerting Considerations Ways to Alert on Significant Events

1: Target Error Rate ≥ SLO Threshold 2: Increased Alert Window 3: Incrementing Alert Duration 4: Alert on Burn Rate 5: Multiple Burn Rate Alerts 6: Multiwindow, Multi-Burn-Rate Alerts

Low-Traffic Services and Error Budget Alerting

Generating Artificial Traffic Combining Services Making Service and Infrastructure Changes Lowering the SLO or Increasing the Window

Extreme Availability Goals Alerting at Scale Conclusion

6. Eliminating Toil

What Is Toil? Measuring Toil Toil Taxonomy

Business Processes Production Interrupts Release Shepherding Migrations Cost Engineering and Capacity Planning Troubleshooting for Opaque Architectures

Toil Management Strategies

Identify and Measure Toil Engineer Toil Out of the System Reject the Toil Use SLOs to Reduce Toil Start with Human-Backed Interfaces Provide Self-Service Methods Get Support from Management and Colleagues Promote Toil Reduction as a Feature Start Small and Then Improve Increase Uniformity Assess Risk Within Automation Automate Toil Response Use Open Source and Third-Party Tools Use Feedback to Improve

Case Studies Case Study 1: Reducing Toil in the Datacenter with Automation

Background Problem Statement What We Decided to Do Design First Effort: Saturn Line-Card Repair Implementation Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair Implementation Lessons Learned

UIs should not introduce overhead or complexity Don’t rely on human expertise Design reusable components Don’t overthink the problem Sometimes imperfect automation is good enough Repair automation is not fire and forget Build in risk assessment and defense in depth Get a failure budget and manager support Think holistically

Case Study 2: Decommissioning Filer-Backed Home Directories

Background Problem Statement What We Decided to Do Design and Implementation Key Components

Moonwalk Moira Portal Archiving and migration automation

Lessons Learned

Challenge assumptions and retire expensive business processes Build self-service interfaces Start with human-backed interfaces Melt snowflakes Employ organizational nudges

Conclusion

7. Simplicity

Measuring Complexity Simplicity Is End-to-End, and SREs Are Good for That

Case Study 1: End-to-End API Simplicity

Background Lessons learned

Case Study 2: Project Lifecycle Complexity

Background What we decided to do Lessons learned

Regaining Simplicity

Case Study 3: Simplification of the Display Ads Spiderweb

Background What we decided to do Lessons learned

Case Study 4: Running Hundreds of Microservices on a Shared Platform

Background What we decided to do Design Outcomes Lessons learned

Case Study 5: pDNS No Longer Depends on Itself

Background Problem statement What we decided to do Lessons learned

Conclusion

II. Practices 8. On-Call

Recap of “Being On-Call” Chapter of First SRE Book Example On-Call Setups Within Google and Outside Google

Google: Forming a New Team

Initial scenario Training roadmap Afterword

Evernote: Finding Our Feet in the Cloud

Moving our on-prem infrastructure to the cloud Adjusting our on-call policies and processes Restructuring our monitoring and metrics Tracking our performance over time Engaging with CRE Sustaining a self-perpetuating cycle

Practical Implementation Details

Anatomy of Pager Load

Scenario: A team in overload Pager load inputs

Preexisting bugs New bugs Identification delay Mitigation delay Alerting Rigor of follow-up Data quality Vigilance

On-Call Flexibility

Scenario: A change in personal circumstances

Automate on-call scheduling Plan for short-term swaps Plan for long-term breaks Plan for part-time work schedules

On-Call Team Dynamics

Scenario: A culture of “survive the week”

Proposal one: Empower your ops engineers Proposal two: Improve team relations

Conclusion

9. Incident Response

Incident Management at Google

Incident Command System Main Roles in Incident Response

Case Studies

Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home

Context Incident Review

Case Study 2: Service Fault—Cache Me If You Can

Context Incident Review

What went well? What could have been handled better?

Case Study 3: Power Outage—Lightning Never Strikes Twice…Until It Does

Context Incident Review

Case Study 4: Incident Response at PagerDuty

Major incident response at PagerDuty Tools used for incident response

Putting Best Practices into Practice

Incident Response Training Prepare Beforehand

Decide on a communication channel Keep your audience informed Prepare a list of contacts Establish criteria for an incident

Drills

Conclusion

10. Postmortem Culture: Learning from Failure

Case Study Bad Postmortem

Why Is This Postmortem Bad?

Missing context Key details omitted Key action item characteristics missing Counterproductive finger pointing Animated language Missing ownership Limited audience Delayed publication

Good Postmortem

Why Is This Postmortem Better?

Clarity Concrete action items Blamelessness Depth Promptness Conciseness

Organizational Incentives

Model and Enforce Blameless Behavior

Use blameless language Include all incident participants in postmortem authoring Gather feedback

Reward Postmortem Outcomes

Reward action item closeout Reward positive organizational change Highlight improved reliability Hold up postmortem owners as leaders Gamification

Share Postmortems Openly

Share announcements across the organization Conduct cross-team reviews Hold training exercises Report incidents and outages weekly

Respond to Postmortem Culture Failures

Avoiding association Failing to reinforce the culture Lacking time to write postmortems Repeating incidents

Tools and Templates

Postmortem Templates

Google’s template Other industry templates

Postmortem Tooling

Postmortem creation Postmortem checklist Postmortem storage Postmortem follow-up Postmortem analysis Other industry tools

Conclusion

11. Managing Load

Google Cloud Load Balancing

Anycast

Stabilized anycast

Maglev Global Software Load Balancer Google Front End GCLB: Low Latency GCLB: High Availability Case Study 1: Pokémon GO on GCLB

Migrating to GCLB Resolving the issue Future-proofing

Autoscaling

Handling Unhealthy Machines Working with Stateful Systems Configuring Conservatively Setting Constraints Including Kill Switches and Manual Overrides Avoiding Overloading Backends Avoiding Traffic Imbalance

Combining Strategies to Manage Load

Case Study 2: When Load Shedding Attacks

What was happening? What went wrong? Lessons learned

Conclusion

12. Introducing Non-Abstract Large System Design

What Is NALSD? Why “Non-Abstract”? AdWords Example

Design Process Initial Requirements One Machine

Calculations Evaluation

Distributed System

MapReduce

Evaluation

LogJoiner

Calculations

Sharded LogJoiner

Evaluation

Multidatacenter

Calculations Evaluation

Conclusion

13. Data Processing Pipelines

Pipeline Applications

Event Processing/Data Transformation to Order or Structure Data Data Analytics Machine Learning

Pipeline Best Practices

Define and Measure Service Level Objectives

Data freshness Data correctness Data isolation/load balancing End-to-end measurement

Plan for Dependency Failure Create and Maintain Pipeline Documentation

System diagrams Process documentation Playbook entries

Map Your Development Lifecycle

Prototyping Testing with a 1% dry run Staging Canarying Performing a partial deployment Deploying to production

Reduce Hotspotting and Workload Patterns Implement Autoscaling and Resource Planning Adhere to Access Control and Security Policies Plan Escalation Paths

Pipeline Requirements and Design

What Features Do You Need? Idempotent and Two-Phase Mutations Checkpointing Code Patterns

Reusing code Using the microservice approach to creating pipelines

Pipeline Production Readiness

Pipeline maturity matrix

Pipeline Failures: Prevention and Response

Potential Failure Modes

Delayed data Corrupt data

Potential Causes

Pipeline dependencies Pipeline application or configuration Unexpected resource growth Region-level outage

Case Study: Spotify

Event Delivery Event Delivery System Design and Architecture

Data collection Extract Transform Load Data delivery

Event Delivery System Operation

Timeliness Skewness Completeness

Customer Integration and Support

Documentation System monitoring Capacity planning Development process Incident handling

Summary

Conclusion

14. Configuration Design and Best Practices

What Is Configuration?

Configuration and Reliability Separating Philosophy and Mechanics

Configuration Philosophy

Configuration Asks Users Questions Questions Should Be Close to User Goals Mandatory and Optional Questions Escaping Simplicity

Mechanics of Configuration

Separate Configuration and Resulting Data Importance of Tooling

Semantic validation Configuration syntax

Ownership and Change Tracking Safe Configuration Change Application

Conclusion

15. Configuration Specifics

Configuration-Induced Toil Reducing Configuration-Induced Toil Critical Properties and Pitfalls of Configuration Systems

Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem Pitfall 2: Designing Accidental or Ad Hoc Language Features Pitfall 3: Building Too Much Domain-Specific Optimization Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects” Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua

Integrating a Configuration Language

Generating Config in Specific Formats Driving Multiple Applications

Integrating an Existing Application: Kubernetes

What Kubernetes Provides Example Kubernetes Config Integrating the Configuration Language

Integrating Custom Applications (In-House Software) Effectively Operating a Configuration System

Versioning Source Control Tooling Testing

When to Evaluate Configuration

Very Early: Checking in the JSON

Pros Cons

Middle of the Road: Evaluate at Build Time

Pros Cons

Late: Evaluate at Runtime

Pros Cons

Guarding Against Abusive Configuration Conclusion

16. Canarying Releases

Release Engineering Principles Balancing Release Velocity and Reliability What Is Canarying? Release Engineering and Canarying

Requirements of a Canary Process Our Example Setup

A Roll Forward Deployment Versus a Simple Canary Deployment Canary Implementation

Minimizing Risk to SLOs and the Error Budget Choosing a Canary Population and Duration

Selecting and Evaluating Metrics

Metrics Should Indicate Problems Metrics Should Be Representative and Attributable Before/After Evaluation Is Risky Use a Gradual Canary for Better Metric Selection

Dependencies and Isolation Canarying in Noninteractive Systems Requirements on Monitoring Data Related Concepts

Blue/Green Deployment Artificial Load Generation Traffic Teeing

Conclusion

III. Processes 17. Identifying and Recovering from Overload

From Load to Overload Case Study 1: Work Overload When Half a Team Leaves

Background Problem Statement What We Decided to Do Implementation Lessons Learned

Case Study 2: Perceived Overload After Organizational and Workload Changes

Background Problem Statement What We Decided to Do Implementation

Short-term actions Mid-term actions Long-term actions

Effects Lessons Learned

Strategies for Mitigating Overload

Recognizing the Symptoms of Overload Reducing Overload and Restoring Team Health

Identify and alleviate psychosocial stressors Prioritize and triage within one quarter Protect yourself in the future

Conclusion

18. SRE Engagement Model

The Service Lifecycle

Phase 1: Architecture and Design Phase 2: Active Development Phase 3: Limited Availability Phase 4: General Availability Phase 5: Deprecation Phase 6: Abandoned Phase 7: Unsupported

Setting Up the Relationship

Communicating Business and Production Priorities Identifying Risks Aligning Goals Setting Ground Rules Planning and Executing

Sustaining an Effective Ongoing Relationship

Investing Time in Working Better Together Maintaining an Open Line of Communication Performing Regular Service Reviews Reassessing When Ground Rules Start to Slip Adjusting Priorities According to Your SLOs and Error Budget Handling Mistakes Appropriately

Sleep on it Meet in person (or as close to it as possible) to resolve issues Be positive Understand differences in communication

Scaling SRE to Larger Environments

Supporting Multiple Services with a Single SRE Team Structuring a Multiple SRE Team Environment Adapting SRE Team Structures to Changing Circumstances Running Cohesive Distributed SRE Teams

Ending the Relationship

Case Study 1: Ares Case Study 2: Data Analysis Pipeline

The pivot Communication breakdown Decommission

Conclusion

19. SRE: Reaching Beyond Your Walls

Truths We Hold to Be Self-Evident

Reliability Is the Most Important Feature Your Users, Not Your Monitoring, Decide Your Reliability If You Run a Platform, Then Reliability Is a Partnership Everything Important Eventually Becomes a Platform When Your Customers Have a Hard Time, You Have to Slow Down You Will Need to Practice SRE with Your Customers

How to: SRE with Your Customers

Step 1: SLOs and SLIs Are How You Speak Step 2: Audit the Monitoring and Build Shared Dashboards Step 3: Measure and Renegotiate Step 4: Design Reviews and Risk Analysis Step 5: Practice, Practice, Practice Be Thoughtful and Disciplined

Conclusion

20. SRE Team Lifecycles

SRE Practices Without SREs Starting an SRE Role

Finding Your First SRE Placing Your First SRE Bootstrapping Your First SRE Distributed SREs

Your First SRE Team

Forming

Creating a new team as part of a major project Assembling a horizontal SRE team Converting a team in place

Storming

Risks and mitigations

New team as part of a major project Horizontal SRE team A team converted in place

Norming Performing

Partnering on architecture Self-regulating workload

Making More SRE Teams

Service Complexity

Where to split Pitfalls

SRE Rollout Geographical Splits

Placement: How many time zones apart should the teams be? People and projects: Seeding the team Parity: Distributing Work Between Offices and Avoiding a “Night Shift” Placement: What about having three shifts? Timing: Should both halves of the team start at the same time? Finance: Travel budget Leadership: Joint ownership of a service

Suggested Practices for Running Many Teams

Mission Control SRE Exchange Training Horizontal Projects SRE Mobility Travel Launch Coordination Engineering Teams Production Excellence SRE Funding and Hiring

Conclusion

21. Organizational Change Management in SRE

SRE Embraces Change Introduction to Change Management

Lewin’s Three-Stage Model McKinsey’s 7-S Model Kotter’s Eight-Step Process for Leading Change The Prosci ADKAR Model Emotion-Based Models The Deming Cycle How These Theories Apply to SRE

Case Study 1: Scaling Waze—From Ad Hoc to Planned Change

Background The Messaging Queue: Replacing a System While Maintaining Reliability The Next Cycle of Change: Improving the Deployment Process Lessons Learned

Case Study 2: Common Tooling Adoption in SRE

Background Problem Statement What We Decided to Do Design Implementation: Monitoring Lessons Learned

Conclusion

Onward… The Future Belongs to the Past SRE + <Insert Other Discipline> Trickles, Streams, and Floods SRE Belongs to All of Us On Gratitude

A. Example SLO Document

Service Overview SLIs and SLOs Rationale Error Budget Clarifications and Caveats

B. Example Error Budget Policy

Service Overview Goals Non-Goals SLO Miss Policy Outage Policy Escalation Policy Background

C. Results of Postmortem Analysis Index

← Prev
Back
Next →

← Prev
Back
Next →