Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Introduction
And So It Begins...
Origin Story
Voices
Forward in All Directions!1
Acknowledgments
I. SRE Implementation
1. Context Versus Control in SRE
2. Interviewing Site Reliability Engineers
Interviewing 101
Who Is Involved
Industry Versus University
Biases
The Funnel
SRE Funnels
Phone Screens
Conducting a phone screen
The Onsite Interview
Coding and system questions
Deep dives and architecture questions
Cultural interviews
Take-Home Questions
Advice for Hiring Managers
Selling candidates
Walking away
Final Thoughts on Interviewing SREs
Further Reading
3. So, You Want to Build an SRE Team?
Choose SRE for the Right Reasons
Orienting to a Data-Driven Approach
Commitment to SRE
Making a Decision About SRE
4. Using Incident Metrics to Improve SRE at Scale
The Virtuous Cycle to the Rescue: If You Don’t Measure It…
Metrics Review: If a Metric Falls in the Forest…
Surrogate Metrics
Repair Debt
Virtual Repair Debt: Exorcising the Ghost in the Machine
Real-Time Dashboards: The Bread and Butter of SRE
Learnings: TL;DR
Further Reading
5. Working with Third Parties Shouldn’t Suck
Build, Buy, or Adopt?
Establish Importance
Identify Stakeholders
Make a Decision
Acknowledge Reality
Is this a core competency?
Integration timeline?
Project Operating Expense and Abandonment Expense
Third Parties as First-Class Citizens
When They’re Down, You’re Down
Direct impact
Indirect impact
Running the Black Box Like a Service
Service-Level Indicators, Service-Level Objectives, and SLAs
SLIs on black boxes
Polling API informs SLIs
Real-time data informs SLIs
Synthetic monitoring informs SLIs
RUM informs SLIs
SLOs
Negotiating SLAs with vendors
Playbook: From Staging to Production
Testing and staging
Monitoring
Uses for synthetic monitoring
Uses for RUM
Tooling
Automation
Logging
Disaster planning
Communication
Decommissioning
Closing Thoughts
6. How to Apply SRE Principles Without Dedicated SRE Teams
SREs to the Rescue! (and How They Failed)
A Matter of Scale in Terms of Headcount
The Embedded SRE
You Build It, You Run It
The Deployment Platform
Closing the Loop: Take Your Own Pager
Introducing Production Engineering
Some Implementation Details
Developers’ Productivity and Health Versus the Pager
Resolving Cross-Team Reliability Issues by Using Postmortems
Uniform Infrastructure and Tooling Versus Autonomy and Innovation
Getting Buy-In
Conclusion
Further Reading
7. SRE Without SRE: The Spotify Case Study
Tabula Rasa: 2006–2007
Prelude
Key Learnings
Beta and Release: 2008–2009
Prelude
Bringing Scalability and Reliability to the Forefront
Key Learnings
The Curse of Success: 2010
Prelude
A New Ownership Model
The dev owner role
The ops owner role
Formalizing Core Services
Blessed Deployment Time Slots
On-Call and Alerting
Not completely pain-free
Spawning Off Internal Office Support
Addressing the Remaining Top Concerns
Long lead times
Unintentional specialization and misalignment
Interruptions
Introducing the goalie role
Creating Detectives
Key Learnings
Pets and Cattle, and Agile: 2011
Prelude
Forming Bad Habits
Breaking Those Bad Habits
Key Learnings
A System That Didn’t Scale: 2012
Prelude
Manual Work Hits a Cliff
Key Learnings
Introducing Ops-in-Squads: 2013–2015
Prelude
Lightening the manual load
Building on Trust
Driving the Paradigm Shift
Key Learnings
Autonomy Versus Consistency: 2015–2017
Prelude
Benefits
Trade-Offs
Key Learnings
The Future: Speed at Scale, Safely
8. Introducing SRE in Large Enterprises
Background
Introducing SRE
Defining Current State
Start by defining the roles and responsibilities of traditional functions in the organization to understand the landscape
Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability
Prepare the business case: calculate cost of similar resources doing duplicate work
To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape
Identifying and Educating Stakeholders
Start having conversations with leaders and champions in the organization
Defining SRE
Presenting the Business Case
Implementing the SRE Team
Setting goals and defining metrics of success
Growing the team: insource or outsource?
Insourcing experienced talent: rotating engineering team members
SRE throughout the development cycle
Defining the role of supporting divisions
Lessons Learned
Sample Implementation Roadmap
Closing Thoughts
Further Reading
9. From SysAdmin to SRE in 8,963 Words
Clarifying Terminology
Service-Level Indicator
SLA
Service-Level Objective
Establishing SLAs for Internal Components
Understanding External Dependencies
Nontechnical Solutions
Tracking Availability Level
Dealing with Corner Cases
Conclusion
10. Clearing the Way for SRE in the Enterprise
Toil, the Enemy of SRE
Toil in the Enterprise
Silos, Queues, and Tickets
Silos Get in the Way
Ticket-Driven Request Queues Are Expensive
Take Action Now
Start by Leaning on Lean
Get Rid of as Many Handoffs as Possible
Replace Remaining Handoffs with Self-Service
Self-Service Is More Than a Button
Self-Service Helps SREs in Multiple Ways
Operations as a Service
Error Budgets, Toil Limits, and Other Tools for Empowering Humans
Error Budgets
Toil Limits
Leverage Existing Enthusiasm for DevOps
Unify Backlogs and Protect Capacity
Psychological Safety and Human Factors
Join the Movement
11. SRE Patterns Loved by DevOps People Everywhere
Pattern 1: Birth of Automated Testing at Google
Pattern 2: Launch and Handoff Readiness Review at Google
Pattern 3: Create a Shared Source Code Repository
Conclusion
Further Reading and Source Material
12. DevOps and SRE: Voices from the Community
Background
Method
Results
Replies
13. Production Engineering at Facebook
II. Near Edge SRE
14. In the Beginning, There Was Chaos
The Problem with Systems
Economic Pillars of Complexity
Beginning Chaos
Navigating Complexity for Safety
Chaos Goes Big
Formalization
Advanced Principles
Frequently Asked Questions
Conclusion
15. The Intersection of Reliability and Privacy
The Intersection of Reliability and Privacy
The General Landscape of Privacy Engineering
Privacy and SRE: Common Approaches
Reducing Toil
Automation
Default behavior for shared architectures
Frameworks
Efficient and Deliberate Problem Solving
Solve challenges once
Find and address root causes
Relationship Management
Early Intervention and Education Through Evangelism
Nuances, Differences, and Trade-Offs
Conclusion
Further Reading
16. Database Reliability Engineering
Guiding Principles of the Database Reliability Engineer
Protect the Data
Self-Service for Scale
Databases Are Not Special
A Culture of Database Reliability Engineering
Recoverability
Considerations for Recovery
Anatomy of a Recovery Strategy
Building Block 1: Detection
User error
Application errors
Infrastructure services
Operating system and hardware errors
Building Block 2: Diverse Storage
Online, high-performance storage
Online, low-performance storage
Offline storage
Object storage
Building Block 3: A Varied Toolbox
Full physical backups
Incremental physical backups
Full and incremental logical backups
Object stores
Building Block 4: Testing
Championing Recovery Reliability
Continuous Delivery: From Development to Production
Education and Collaboration
Architecture
Data model
Best practices and standards
Tools
Collaboration
Deployment
Migrations and Versioning
Impact Analysis
Migration Patterns
Migration testing
Rollback testing
Championing CD
Making the Case for DBRE
Further Reading
17. Engineering for Data Durability
Replication Is Table Stakes
Backups
Restoration
Freshness
Replication
Estimating durability
Real-World Durability
Isolation
Physical isolation
Logical isolation
Operational isolation
Protection
Testing
Safeguards
Recovery
Verification
The Power of Zero
Verification Coverage
Disk Scrubber
Index Scanner
Storage Watcher
Watching the Watchers
Automation
Window of Vulnerability
Operator Fatigue
Reliability
Conclusion
18. Introduction to Machine Learning for SRE
Why Use Machine Learning for SRE?
Why and How Should My Company Be Engaging in This?
Some SRE Problems Machine Learning Can Help Solve
The Awakening of Applied AI
What Is Machine Learning?
What Do We Mean by Learning?
From Chess to Go: How Deep Can We Dive?
Why Now? What Changed for Us?
What Are Neural Networks?
Neurons and Neural Networks
How and When Should We Apply Neural Networks?
What Kinds of Data Can We Use?
Practical Machine Learning
Popular Libraries for Neural Networks
Practical Machine Learning Examples
Installing Python, IPython, and Jupyter Notebook
Decision trees
A neural network from scratch
Using TensorFlow and TensorBoard
Time series: server requests waiting
Success Stories
Further Reading
My GitHub Repository
Recommended Books
III. SRE Best Practices and Technologies
19. Do Docs Better: Integrating Documentation into the Engineering Workflow
Defining Quality: What Do Good Docs Look Like?
Functional Requirements for SRE Documentation
Service overviews
Playbooks
Postmortems
Policies
SLAs
Defining success metrics
Integrating Docs into the Engineering Workflow
The Google Experience: g3doc and EngPlay
What We Learned
Where possible, documentation should live in source control, alongside its associated code
Pick the simplest markup language that supports your needs
Integrations are key to adoption
Doing Docs Better: Best Practices
Create Templates for Each Documentation Type
Better > Best: Set Realistic Standards for Quality
Require Docs as Part of Code Review
Ruthlessly Prune Your Docs
Recognize and Reward Documentation
Communicating the Value of Documentation
Further Reading
20. Active Teaching and Learning
Active Learning
Active Learning Example: Wheel of Misfortune
Active Learning Example: Incident Manager (a Card Game)
Active Learning Example: SRE Classroom
The Costs of Failing to Learn
Learning Habits of Effective SRE Teams
Production Meetings
Postmortems
A Call to Action: Ditch the Boring Slides
21. The Art and Science of the Service-Level Objective
Why Set Goals?
Availability
Time Quanta
Transactions
Transactions over Time Quanta
On Evaluating SLOs
Histograms
Where Percentiles Fall Down (and Histograms Step Up)
Parting Thought: Looking at SLOs Upside Down
Further Reading
22. SRE as a Success Culture
Where Did SRE Come From?
Key Values for SRE
Keeping the Site Up
Isolated failure domains
Redundant systems
Graduated degradation
Empowering Teams to “Do the Right Thing”
Approaching Operations as an Engineering Problem
Achieving Business Success Through Promises (Service Levels)
Progression in Service-Level Execution
Critical Enabling Functions of SRE
Monitoring, Metrics, and KPIs
Incident Management and Emergency Response
Capacity Planning and Demand Forecasting
Performance Analysis and Optimization
Provisioning, Change Management, and Velocity
Phases of SRE Execution
Phase 1: Firefighting/Reactive
Phase 2: Gatekeepers
Phase 3: Advocates/Partners
Phase 4: Catalytic
Complications of Differing Phases
Focus on the Details of Success
Further Reading
23. SRE Antipatterns
Antipattern 1: Site Reliability Operations
Antipattern 2: Humans Staring at Screens
Antipattern 3: Mob Incident Response
Antipattern 4: Root Cause = Human Error
Antipattern 5: Passing the Pager
Antipattern 6: Magic Smoke Jumping!
Antipattern 7: Alert Reliability Engineering
Antipattern 8: Hiring a Dog-Walker to Tend Your Pets
Antipattern 9: Speed-Bump Engineering
Antipattern 10: Design Chokepoints
Antipattern 11: Too Much Stick, Not Enough Carrot
Antipattern 12: Postponing Production
Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
Antipattern 14: Dependency Hell
Antipattern 15: Ungainly Governance
Antipattern 16: Ill-Considered SLOh-Ohs
Antipattern 17: Tossing Your API Over the Firewall
Antipattern 18: Fixing the Ops Team
So, That’s It, Then?
24. Immutable Infrastructure and SRE
Scalability, Reliability, and Performance
Failure Recovery
Simpler Operations
Faster Startup Times
Known State
Continuous Integration/Continuous Deployment with Confidence
Security
Multiregion Operations
Release Engineering
Building the Base Image
Deploying Applications
Disadvantages
Conclusion
25. Scriptable Load Balancers
Scriptable Load Balancers: The New Kid on the Block
Why Scriptable Load Balancers?
Making the Difficult Easy
Shard-Aware Routing
Routing requests with DNS
Routing queries in the application
Routing requests in the application
Routing requests with a scriptable load balancer
Harnessing Potential
Case Study: Intermission
Service-Level Middleware
Middleware to the Rescue
APIs of Service-Level Middleware
Case Study: WAF/Bot Mitigation
Avoiding Disaster
Getting Clever with State
Case Study: Checkout Queue
Looking to the Future and Further Reading
26. The Service Mesh: Wrangler of Your Microservices?
Ready to Get Rid of the Monolith?
Current State of Microservice Networking
Service Mesh to the Rescue
The Benefits of a Sidecar Proxy
Eventually Consistent Service Discovery
Observability and Alarming
Sidecar Performance Implications
Thin Libraries and Context Propagation
Configuration Management (Control Plane Versus Data Plane)
The Service Mesh in Practice
The Origin and Development of Envoy at Lyft
Operating Envoy at Lyft
Operational learnings
Development learnings
Technical learnings
The Future of the Service Mesh
Further Reading
IV. The Human Side of SRE
27. Psychological Safety in SRE
The Primary Indicator of a Successful Team
How to Build Psychological Safety into Your Own Team
Make respect part of your team’s culture
Make space for people to take chances
Make it obvious when your team is doing well
Make your communication clear and your expectations explicit
Make your team feel safe
Why are operations teams more likely to feel unsafe than other engineering teams?
We love interrupts and the torrents of information
On-call and operations
Cognitive overload
Imaginary expectations
Operations teams are bad at estimating their level of psychological safety
Further Reading
28. SRE Cognitive Work
Introduction
What Do SRE People Do?
Why Should We Care About Practitioner Cognition?
Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
Human Performance in Modern Complex Systems: The Main Themes
Observations on SRE Cognitive Work Around Incidents
Every Incident Could Have Been Worse
Sacrifice Decisions Take Place Under Uncertainty
Repairs to Functional Systems
Special Knowledge About Complex Systems
Managing the Costs of Coordination
Classification schemes
Formal role assignments
SREs Are Cognitive Agents Working in a Joint Cognitive System
The Calibration Problem
Mental Models
Incidents Trigger Individual Recalibration
Incidents Are Opportunities for Collective Recalibration
What Are the Implications of All This?
Incidents Will Continue
Incidents Will Impose Costs
Incident Patterns Will Change
Incidents Point to Specific Calibration Problems and Locations
What Should Happen Next?
Build a Corpus of Cases
Focus on Making Automation a Team Player in SRE Work
Address the Calibration Problem
What Can You Do?
Conclusion
References
29. Beyond Burnout
Defining Mental Disorders
Mental Disorders Are Missing from the Diversity Conversation
Sanity Isn’t a Business Requirement
Thoughts and Prayers Aren’t Scalable
Full-Stack Inclusivity
Application
Interviewing
Compensation
Benefits
Onboarding
Working Conditions
Job Duties
Training
Promotion
Leaving
Inclusivity for Anyone Helps Everyone
Mental Disorder Resources
30. Against On-Call: A Polemic
The Rationale for On-Call
First, Do No Harm
Parallels with SRE
Differences with SRE
Underlying Assumptions Driving On-Call for Engineers
On-Call Is Emergency Medicine Instead of Ward Medicine
Counterarguments
The Cost to Humans of Doing On-Call
We don’t need another hero
Actual Solutions
Training
Prioritization
Accommodations
Compensation
Flexible schedules
Recovery
Exclusion backlash
Improving On-the-Job Performance
Cognitive hacks
We Need a Fundamental Change in Approach
Strong-Anti-On-Call
Weak-Anti-On-Call
A Union of the Two
Conclusion
31. Elegy for Complex Systems
The Computer and Human Systems Cannot Be Separated
Decoherence and Cascading Failure
Always in a State of Partial Failure
Novelty Priority Inversion
Nobody Anticipates the Overhead of Coordination
Your healthcare.gov Is Out There
To Get Involved
Further Reading
32. Intersections Between Operations and Social Activism
Before, During, After
Creating the Perfect Plan
Principles of Organizing
Principles 1 and 2 (interfaces and incident command)
Principles 3 and 4 (blameless retrospectives and psychological safety)
Managing Crisis: Responding When Things Break Down
Handling chaos: contrast in responses during the July 8 KKK rally
Preparing for the worst: handling terror at Unite the Right
The corollary to trust is forgiveness
Writing Our Own History: Making Sense of What Went Down
Charlottesville in review: assigning and avoiding blame
Beyond culpability: building capacity instead of assigning blame
The Long Tail: Turning Action into Change
Activism and Change Within a Company
Conclusion
33. Conclusion
Index
← Prev
Back
Next →
← Prev
Back
Next →