B07gq2yy1d Ebok by Blank-Edelman, David N. -- Read -- Imperial Library of Trantor

Index

Introduction

And So It Begins... Origin Story Voices Forward in All Directions!1 Acknowledgments

I. SRE Implementation 1. Context Versus Control in SRE 2. Interviewing Site Reliability Engineers

Interviewing 101

Who Is Involved Industry Versus University Biases The Funnel

SRE Funnels

Phone Screens

Conducting a phone screen

The Onsite Interview

Coding and system questions Deep dives and architecture questions Cultural interviews

Take-Home Questions Advice for Hiring Managers

Selling candidates Walking away

Final Thoughts on Interviewing SREs Further Reading

3. So, You Want to Build an SRE Team?

Choose SRE for the Right Reasons Orienting to a Data-Driven Approach Commitment to SRE Making a Decision About SRE

4. Using Incident Metrics to Improve SRE at Scale

The Virtuous Cycle to the Rescue: If You Don’t Measure It… Metrics Review: If a Metric Falls in the Forest… Surrogate Metrics Repair Debt Virtual Repair Debt: Exorcising the Ghost in the Machine Real-Time Dashboards: The Bread and Butter of SRE Learnings: TL;DR Further Reading

5. Working with Third Parties Shouldn’t Suck

Build, Buy, or Adopt?

Establish Importance Identify Stakeholders Make a Decision Acknowledge Reality

Is this a core competency? Integration timeline? Project Operating Expense and Abandonment Expense

Third Parties as First-Class Citizens

When They’re Down, You’re Down

Direct impact Indirect impact

Running the Black Box Like a Service Service-Level Indicators, Service-Level Objectives, and SLAs

SLIs on black boxes

Polling API informs SLIs Real-time data informs SLIs Synthetic monitoring informs SLIs RUM informs SLIs

SLOs

Negotiating SLAs with vendors

Playbook: From Staging to Production

Testing and staging Monitoring

Uses for synthetic monitoring Uses for RUM

Tooling Automation Logging Disaster planning Communication Decommissioning

Closing Thoughts

6. How to Apply SRE Principles Without Dedicated SRE Teams

SREs to the Rescue! (and How They Failed)

A Matter of Scale in Terms of Headcount The Embedded SRE

You Build It, You Run It

The Deployment Platform Closing the Loop: Take Your Own Pager Introducing Production Engineering

Some Implementation Details

Developers’ Productivity and Health Versus the Pager Resolving Cross-Team Reliability Issues by Using Postmortems Uniform Infrastructure and Tooling Versus Autonomy and Innovation Getting Buy-In

Conclusion Further Reading

7. SRE Without SRE: The Spotify Case Study

Tabula Rasa: 2006–2007

Prelude Key Learnings

Beta and Release: 2008–2009

Prelude Bringing Scalability and Reliability to the Forefront Key Learnings

The Curse of Success: 2010

Prelude A New Ownership Model

The dev owner role The ops owner role

Formalizing Core Services Blessed Deployment Time Slots On-Call and Alerting

Not completely pain-free

Spawning Off Internal Office Support Addressing the Remaining Top Concerns

Long lead times Unintentional specialization and misalignment Interruptions Introducing the goalie role

Creating Detectives Key Learnings

Pets and Cattle, and Agile: 2011

Prelude Forming Bad Habits Breaking Those Bad Habits Key Learnings

A System That Didn’t Scale: 2012

Prelude Manual Work Hits a Cliff Key Learnings

Introducing Ops-in-Squads: 2013–2015

Prelude

Lightening the manual load

Building on Trust Driving the Paradigm Shift Key Learnings

Autonomy Versus Consistency: 2015–2017

Prelude Benefits Trade-Offs Key Learnings

The Future: Speed at Scale, Safely

8. Introducing SRE in Large Enterprises

Background Introducing SRE

Defining Current State

Start by defining the roles and responsibilities of traditional functions in the organization to understand the landscape Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability Prepare the business case: calculate cost of similar resources doing duplicate work To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape

Identifying and Educating Stakeholders

Start having conversations with leaders and champions in the organization Defining SRE

Presenting the Business Case Implementing the SRE Team

Setting goals and defining metrics of success Growing the team: insource or outsource? Insourcing experienced talent: rotating engineering team members SRE throughout the development cycle Defining the role of supporting divisions

Lessons Learned Sample Implementation Roadmap

Closing Thoughts Further Reading

9. From SysAdmin to SRE in 8,963 Words

Clarifying Terminology

Service-Level Indicator SLA Service-Level Objective

Establishing SLAs for Internal Components Understanding External Dependencies Nontechnical Solutions Tracking Availability Level Dealing with Corner Cases Conclusion

10. Clearing the Way for SRE in the Enterprise

Toil, the Enemy of SRE Toil in the Enterprise Silos, Queues, and Tickets

Silos Get in the Way Ticket-Driven Request Queues Are Expensive

Take Action Now Start by Leaning on Lean Get Rid of as Many Handoffs as Possible Replace Remaining Handoffs with Self-Service

Self-Service Is More Than a Button Self-Service Helps SREs in Multiple Ways Operations as a Service

Error Budgets, Toil Limits, and Other Tools for Empowering Humans

Error Budgets Toil Limits Leverage Existing Enthusiasm for DevOps Unify Backlogs and Protect Capacity Psychological Safety and Human Factors

Join the Movement

11. SRE Patterns Loved by DevOps People Everywhere

Pattern 1: Birth of Automated Testing at Google Pattern 2: Launch and Handoff Readiness Review at Google Pattern 3: Create a Shared Source Code Repository Conclusion Further Reading and Source Material

12. DevOps and SRE: Voices from the Community

Background Method Results Replies

13. Production Engineering at Facebook II. Near Edge SRE 14. In the Beginning, There Was Chaos

The Problem with Systems Economic Pillars of Complexity Beginning Chaos Navigating Complexity for Safety Chaos Goes Big Formalization Advanced Principles Frequently Asked Questions Conclusion

15. The Intersection of Reliability and Privacy

The Intersection of Reliability and Privacy The General Landscape of Privacy Engineering Privacy and SRE: Common Approaches

Reducing Toil

Automation Default behavior for shared architectures Frameworks

Efficient and Deliberate Problem Solving

Solve challenges once Find and address root causes

Relationship Management Early Intervention and Education Through Evangelism

Nuances, Differences, and Trade-Offs Conclusion Further Reading

16. Database Reliability Engineering

Guiding Principles of the Database Reliability Engineer

Protect the Data Self-Service for Scale Databases Are Not Special

A Culture of Database Reliability Engineering Recoverability

Considerations for Recovery Anatomy of a Recovery Strategy Building Block 1: Detection

User error Application errors Infrastructure services Operating system and hardware errors

Building Block 2: Diverse Storage

Online, high-performance storage Online, low-performance storage Offline storage Object storage

Building Block 3: A Varied Toolbox

Full physical backups Incremental physical backups Full and incremental logical backups Object stores

Building Block 4: Testing Championing Recovery Reliability

Continuous Delivery: From Development to Production

Education and Collaboration

Architecture Data model Best practices and standards Tools

Collaboration Deployment

Migrations and Versioning Impact Analysis Migration Patterns

Migration testing Rollback testing

Championing CD

Making the Case for DBRE Further Reading

17. Engineering for Data Durability

Replication Is Table Stakes

Backups

Restoration Freshness

Replication

Estimating durability

Real-World Durability

Isolation

Physical isolation Logical isolation Operational isolation

Protection

Testing Safeguards Recovery

Verification

The Power of Zero Verification Coverage

Disk Scrubber Index Scanner Storage Watcher

Watching the Watchers

Automation

Window of Vulnerability Operator Fatigue Reliability

Conclusion

18. Introduction to Machine Learning for SRE

Why Use Machine Learning for SRE? Why and How Should My Company Be Engaging in This?

Some SRE Problems Machine Learning Can Help Solve

The Awakening of Applied AI What Is Machine Learning?

What Do We Mean by Learning? From Chess to Go: How Deep Can We Dive? Why Now? What Changed for Us?

What Are Neural Networks?

Neurons and Neural Networks How and When Should We Apply Neural Networks? What Kinds of Data Can We Use?

Practical Machine Learning

Popular Libraries for Neural Networks Practical Machine Learning Examples

Installing Python, IPython, and Jupyter Notebook Decision trees A neural network from scratch Using TensorFlow and TensorBoard Time series: server requests waiting

Success Stories Further Reading

My GitHub Repository Recommended Books

III. SRE Best Practices and Technologies 19. Do Docs Better: Integrating Documentation into the Engineering Workflow

Defining Quality: What Do Good Docs Look Like?

Functional Requirements for SRE Documentation

Service overviews Playbooks Postmortems Policies SLAs Defining success metrics

Integrating Docs into the Engineering Workflow

The Google Experience: g3doc and EngPlay What We Learned

Where possible, documentation should live in source control, alongside its associated code Pick the simplest markup language that supports your needs Integrations are key to adoption

Doing Docs Better: Best Practices

Create Templates for Each Documentation Type Better > Best: Set Realistic Standards for Quality Require Docs as Part of Code Review Ruthlessly Prune Your Docs Recognize and Reward Documentation

Communicating the Value of Documentation Further Reading

20. Active Teaching and Learning

Active Learning

Active Learning Example: Wheel of Misfortune Active Learning Example: Incident Manager (a Card Game) Active Learning Example: SRE Classroom

The Costs of Failing to Learn Learning Habits of Effective SRE Teams

Production Meetings Postmortems

A Call to Action: Ditch the Boring Slides

21. The Art and Science of the Service-Level Objective

Why Set Goals? Availability

Time Quanta Transactions Transactions over Time Quanta

On Evaluating SLOs Histograms Where Percentiles Fall Down (and Histograms Step Up) Parting Thought: Looking at SLOs Upside Down Further Reading

22. SRE as a Success Culture

Where Did SRE Come From? Key Values for SRE

Keeping the Site Up

Isolated failure domains Redundant systems Graduated degradation

Empowering Teams to “Do the Right Thing” Approaching Operations as an Engineering Problem Achieving Business Success Through Promises (Service Levels)

Progression in Service-Level Execution

Critical Enabling Functions of SRE

Monitoring, Metrics, and KPIs Incident Management and Emergency Response Capacity Planning and Demand Forecasting Performance Analysis and Optimization Provisioning, Change Management, and Velocity

Phases of SRE Execution

Phase 1: Firefighting/Reactive Phase 2: Gatekeepers Phase 3: Advocates/Partners Phase 4: Catalytic Complications of Differing Phases

Focus on the Details of Success Further Reading

23. SRE Antipatterns

Antipattern 1: Site Reliability Operations Antipattern 2: Humans Staring at Screens Antipattern 3: Mob Incident Response Antipattern 4: Root Cause = Human Error Antipattern 5: Passing the Pager Antipattern 6: Magic Smoke Jumping! Antipattern 7: Alert Reliability Engineering Antipattern 8: Hiring a Dog-Walker to Tend Your Pets Antipattern 9: Speed-Bump Engineering Antipattern 10: Design Chokepoints Antipattern 11: Too Much Stick, Not Enough Carrot Antipattern 12: Postponing Production Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR) Antipattern 14: Dependency Hell Antipattern 15: Ungainly Governance Antipattern 16: Ill-Considered SLOh-Ohs Antipattern 17: Tossing Your API Over the Firewall Antipattern 18: Fixing the Ops Team So, That’s It, Then?

24. Immutable Infrastructure and SRE

Scalability, Reliability, and Performance Failure Recovery Simpler Operations Faster Startup Times Known State Continuous Integration/Continuous Deployment with Confidence Security Multiregion Operations Release Engineering Building the Base Image Deploying Applications Disadvantages Conclusion

25. Scriptable Load Balancers

Scriptable Load Balancers: The New Kid on the Block

Why Scriptable Load Balancers?

Making the Difficult Easy

Shard-Aware Routing

Routing requests with DNS Routing queries in the application Routing requests in the application Routing requests with a scriptable load balancer

Harnessing Potential Case Study: Intermission

Service-Level Middleware

Middleware to the Rescue APIs of Service-Level Middleware Case Study: WAF/Bot Mitigation

Avoiding Disaster

Getting Clever with State Case Study: Checkout Queue

Looking to the Future and Further Reading

26. The Service Mesh: Wrangler of Your Microservices?

Ready to Get Rid of the Monolith? Current State of Microservice Networking Service Mesh to the Rescue

The Benefits of a Sidecar Proxy Eventually Consistent Service Discovery Observability and Alarming Sidecar Performance Implications Thin Libraries and Context Propagation Configuration Management (Control Plane Versus Data Plane)

The Service Mesh in Practice

The Origin and Development of Envoy at Lyft Operating Envoy at Lyft

Operational learnings Development learnings Technical learnings

The Future of the Service Mesh Further Reading

IV. The Human Side of SRE 27. Psychological Safety in SRE

The Primary Indicator of a Successful Team

How to Build Psychological Safety into Your Own Team

Make respect part of your team’s culture Make space for people to take chances Make it obvious when your team is doing well Make your communication clear and your expectations explicit Make your team feel safe Why are operations teams more likely to feel unsafe than other engineering teams?

We love interrupts and the torrents of information On-call and operations Cognitive overload Imaginary expectations Operations teams are bad at estimating their level of psychological safety