Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Introduction
And So It Begins... Origin Story Voices Forward in All Directions!1 Acknowledgments
I. SRE Implementation 1. Context Versus Control in SRE 2. Interviewing Site Reliability Engineers
Interviewing 101
Who Is Involved Industry Versus University Biases The Funnel
SRE Funnels
Phone Screens
Conducting a phone screen
The Onsite Interview
Coding and system questions Deep dives and architecture questions Cultural interviews
Take-Home Questions Advice for Hiring Managers
Selling candidates Walking away
Final Thoughts on Interviewing SREs Further Reading
3. So, You Want to Build an SRE Team?
Choose SRE for the Right Reasons Orienting to a Data-Driven Approach Commitment to SRE Making a Decision About SRE
4. Using Incident Metrics to Improve SRE at Scale
The Virtuous Cycle to the Rescue: If You Don’t Measure It… Metrics Review: If a Metric Falls in the Forest… Surrogate Metrics Repair Debt Virtual Repair Debt: Exorcising the Ghost in the Machine Real-Time Dashboards: The Bread and Butter of SRE Learnings: TL;DR Further Reading
5. Working with Third Parties Shouldn’t Suck
Build, Buy, or Adopt?
Establish Importance Identify Stakeholders Make a Decision Acknowledge Reality
Is this a core competency? Integration timeline? Project Operating Expense and Abandonment Expense
Third Parties as First-Class Citizens
When They’re Down, You’re Down
Direct impact Indirect impact
Running the Black Box Like a Service Service-Level Indicators, Service-Level Objectives, and SLAs
SLIs on black boxes
Polling API informs SLIs Real-time data informs SLIs Synthetic monitoring informs SLIs RUM informs SLIs
SLOs
Negotiating SLAs with vendors
Playbook: From Staging to Production
Testing and staging Monitoring
Uses for synthetic monitoring Uses for RUM
Tooling Automation Logging Disaster planning Communication Decommissioning
Closing Thoughts
6. How to Apply SRE Principles Without Dedicated SRE Teams
SREs to the Rescue! (and How They Failed)
A Matter of Scale in Terms of Headcount The Embedded SRE
You Build It, You Run It
The Deployment Platform Closing the Loop: Take Your Own Pager Introducing Production Engineering
Some Implementation Details
Developers’ Productivity and Health Versus the Pager Resolving Cross-Team Reliability Issues by Using Postmortems Uniform Infrastructure and Tooling Versus Autonomy and Innovation Getting Buy-In
Conclusion Further Reading
7. SRE Without SRE: The Spotify Case Study
Tabula Rasa: 2006–2007
Prelude Key Learnings
Beta and Release: 2008–2009
Prelude Bringing Scalability and Reliability to the Forefront Key Learnings
The Curse of Success: 2010
Prelude A New Ownership Model
The dev owner role The ops owner role
Formalizing Core Services Blessed Deployment Time Slots On-Call and Alerting
Not completely pain-free
Spawning Off Internal Office Support Addressing the Remaining Top Concerns
Long lead times Unintentional specialization and misalignment Interruptions Introducing the goalie role
Creating Detectives Key Learnings
Pets and Cattle, and Agile: 2011
Prelude Forming Bad Habits Breaking Those Bad Habits Key Learnings
A System That Didn’t Scale: 2012
Prelude Manual Work Hits a Cliff Key Learnings
Introducing Ops-in-Squads: 2013–2015
Prelude
Lightening the manual load
Building on Trust Driving the Paradigm Shift Key Learnings
Autonomy Versus Consistency: 2015–2017
Prelude Benefits Trade-Offs Key Learnings
The Future: Speed at Scale, Safely
8. Introducing SRE in Large Enterprises
Background Introducing SRE
Defining Current State
Start by defining the roles and responsibilities of traditional functions in the organization to understand the landscape Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability Prepare the business case: calculate cost of similar resources doing duplicate work To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape
Identifying and Educating Stakeholders
Start having conversations with leaders and champions in the organization Defining SRE
Presenting the Business Case Implementing the SRE Team
Setting goals and defining metrics of success Growing the team: insource or outsource? Insourcing experienced talent: rotating engineering team members SRE throughout the development cycle Defining the role of supporting divisions
Lessons Learned Sample Implementation Roadmap
Closing Thoughts Further Reading
9. From SysAdmin to SRE in 8,963 Words
Clarifying Terminology
Service-Level Indicator SLA Service-Level Objective
Establishing SLAs for Internal Components Understanding External Dependencies Nontechnical Solutions Tracking Availability Level Dealing with Corner Cases Conclusion
10. Clearing the Way for SRE in the Enterprise
Toil, the Enemy of SRE Toil in the Enterprise Silos, Queues, and Tickets
Silos Get in the Way Ticket-Driven Request Queues Are Expensive
Take Action Now Start by Leaning on Lean Get Rid of as Many Handoffs as Possible Replace Remaining Handoffs with Self-Service
Self-Service Is More Than a Button Self-Service Helps SREs in Multiple Ways Operations as a Service
Error Budgets, Toil Limits, and Other Tools for Empowering Humans
Error Budgets Toil Limits Leverage Existing Enthusiasm for DevOps Unify Backlogs and Protect Capacity Psychological Safety and Human Factors
Join the Movement
11. SRE Patterns Loved by DevOps People Everywhere
Pattern 1: Birth of Automated Testing at Google Pattern 2: Launch and Handoff Readiness Review at Google Pattern 3: Create a Shared Source Code Repository Conclusion Further Reading and Source Material
12. DevOps and SRE: Voices from the Community
Background Method Results Replies
13. Production Engineering at Facebook II. Near Edge SRE 14. In the Beginning, There Was Chaos
The Problem with Systems Economic Pillars of Complexity Beginning Chaos Navigating Complexity for Safety Chaos Goes Big Formalization Advanced Principles Frequently Asked Questions Conclusion
15. The Intersection of Reliability and Privacy
The Intersection of Reliability and Privacy The General Landscape of Privacy Engineering Privacy and SRE: Common Approaches
Reducing Toil
Automation Default behavior for shared architectures Frameworks
Efficient and Deliberate Problem Solving
Solve challenges once Find and address root causes
Relationship Management Early Intervention and Education Through Evangelism
Nuances, Differences, and Trade-Offs Conclusion Further Reading
16. Database Reliability Engineering
Guiding Principles of the Database Reliability Engineer
Protect the Data Self-Service for Scale Databases Are Not Special
A Culture of Database Reliability Engineering Recoverability
Considerations for Recovery Anatomy of a Recovery Strategy Building Block 1: Detection
User error Application errors Infrastructure services Operating system and hardware errors
Building Block 2: Diverse Storage
Online, high-performance storage Online, low-performance storage Offline storage Object storage
Building Block 3: A Varied Toolbox
Full physical backups Incremental physical backups Full and incremental logical backups Object stores
Building Block 4: Testing Championing Recovery Reliability
Continuous Delivery: From Development to Production
Education and Collaboration
Architecture Data model Best practices and standards Tools
Collaboration Deployment
Migrations and Versioning Impact Analysis Migration Patterns
Migration testing Rollback testing
Championing CD
Making the Case for DBRE Further Reading
17. Engineering for Data Durability
Replication Is Table Stakes
Backups
Restoration Freshness
Replication
Estimating durability
Real-World Durability
Isolation
Physical isolation Logical isolation Operational isolation
Protection
Testing Safeguards Recovery
Verification
The Power of Zero Verification Coverage
Disk Scrubber Index Scanner Storage Watcher
Watching the Watchers
Automation
Window of Vulnerability Operator Fatigue Reliability
Conclusion
18. Introduction to Machine Learning for SRE
Why Use Machine Learning for SRE? Why and How Should My Company Be Engaging in This?
Some SRE Problems Machine Learning Can Help Solve
The Awakening of Applied AI What Is Machine Learning?
What Do We Mean by Learning? From Chess to Go: How Deep Can We Dive? Why Now? What Changed for Us?
What Are Neural Networks?
Neurons and Neural Networks How and When Should We Apply Neural Networks? What Kinds of Data Can We Use?
Practical Machine Learning
Popular Libraries for Neural Networks Practical Machine Learning Examples
Installing Python, IPython, and Jupyter Notebook Decision trees A neural network from scratch Using TensorFlow and TensorBoard Time series: server requests waiting
Success Stories Further Reading
My GitHub Repository Recommended Books
III. SRE Best Practices and Technologies 19. Do Docs Better: Integrating Documentation into the Engineering Workflow
Defining Quality: What Do Good Docs Look Like?
Functional Requirements for SRE Documentation
Service overviews Playbooks Postmortems Policies SLAs Defining success metrics
Integrating Docs into the Engineering Workflow
The Google Experience: g3doc and EngPlay What We Learned
Where possible, documentation should live in source control, alongside its associated code Pick the simplest markup language that supports your needs Integrations are key to adoption
Doing Docs Better: Best Practices
Create Templates for Each Documentation Type Better > Best: Set Realistic Standards for Quality Require Docs as Part of Code Review Ruthlessly Prune Your Docs Recognize and Reward Documentation
Communicating the Value of Documentation Further Reading
20. Active Teaching and Learning
Active Learning
Active Learning Example: Wheel of Misfortune Active Learning Example: Incident Manager (a Card Game) Active Learning Example: SRE Classroom
The Costs of Failing to Learn Learning Habits of Effective SRE Teams
Production Meetings Postmortems
A Call to Action: Ditch the Boring Slides
21. The Art and Science of the Service-Level Objective
Why Set Goals? Availability
Time Quanta Transactions Transactions over Time Quanta
On Evaluating SLOs Histograms Where Percentiles Fall Down (and Histograms Step Up) Parting Thought: Looking at SLOs Upside Down Further Reading
22. SRE as a Success Culture
Where Did SRE Come From? Key Values for SRE
Keeping the Site Up
Isolated failure domains Redundant systems Graduated degradation
Empowering Teams to “Do the Right Thing” Approaching Operations as an Engineering Problem Achieving Business Success Through Promises (Service Levels)
Progression in Service-Level Execution
Critical Enabling Functions of SRE
Monitoring, Metrics, and KPIs Incident Management and Emergency Response Capacity Planning and Demand Forecasting Performance Analysis and Optimization Provisioning, Change Management, and Velocity
Phases of SRE Execution
Phase 1: Firefighting/Reactive Phase 2: Gatekeepers Phase 3: Advocates/Partners Phase 4: Catalytic Complications of Differing Phases
Focus on the Details of Success Further Reading
23. SRE Antipatterns
Antipattern 1: Site Reliability Operations Antipattern 2: Humans Staring at Screens Antipattern 3: Mob Incident Response Antipattern 4: Root Cause = Human Error Antipattern 5: Passing the Pager Antipattern 6: Magic Smoke Jumping! Antipattern 7: Alert Reliability Engineering Antipattern 8: Hiring a Dog-Walker to Tend Your Pets Antipattern 9: Speed-Bump Engineering Antipattern 10: Design Chokepoints Antipattern 11: Too Much Stick, Not Enough Carrot Antipattern 12: Postponing Production Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR) Antipattern 14: Dependency Hell Antipattern 15: Ungainly Governance Antipattern 16: Ill-Considered SLOh-Ohs Antipattern 17: Tossing Your API Over the Firewall Antipattern 18: Fixing the Ops Team So, That’s It, Then?
24. Immutable Infrastructure and SRE
Scalability, Reliability, and Performance Failure Recovery Simpler Operations Faster Startup Times Known State Continuous Integration/Continuous Deployment with Confidence Security Multiregion Operations Release Engineering Building the Base Image Deploying Applications Disadvantages Conclusion
25. Scriptable Load Balancers
Scriptable Load Balancers: The New Kid on the Block
Why Scriptable Load Balancers?
Making the Difficult Easy
Shard-Aware Routing
Routing requests with DNS Routing queries in the application Routing requests in the application Routing requests with a scriptable load balancer
Harnessing Potential Case Study: Intermission
Service-Level Middleware
Middleware to the Rescue APIs of Service-Level Middleware Case Study: WAF/Bot Mitigation
Avoiding Disaster
Getting Clever with State Case Study: Checkout Queue
Looking to the Future and Further Reading
26. The Service Mesh: Wrangler of Your Microservices?
Ready to Get Rid of the Monolith? Current State of Microservice Networking Service Mesh to the Rescue
The Benefits of a Sidecar Proxy Eventually Consistent Service Discovery Observability and Alarming Sidecar Performance Implications Thin Libraries and Context Propagation Configuration Management (Control Plane Versus Data Plane)
The Service Mesh in Practice
The Origin and Development of Envoy at Lyft Operating Envoy at Lyft
Operational learnings Development learnings Technical learnings
The Future of the Service Mesh Further Reading
IV. The Human Side of SRE 27. Psychological Safety in SRE
The Primary Indicator of a Successful Team
How to Build Psychological Safety into Your Own Team
Make respect part of your team’s culture Make space for people to take chances Make it obvious when your team is doing well Make your communication clear and your expectations explicit Make your team feel safe Why are operations teams more likely to feel unsafe than other engineering teams?
We love interrupts and the torrents of information On-call and operations Cognitive overload Imaginary expectations Operations teams are bad at estimating their level of psychological safety
Further Reading
28. SRE Cognitive Work
Introduction What Do SRE People Do? Why Should We Care About Practitioner Cognition?
Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted Human Performance in Modern Complex Systems: The Main Themes
Observations on SRE Cognitive Work Around Incidents
Every Incident Could Have Been Worse Sacrifice Decisions Take Place Under Uncertainty Repairs to Functional Systems Special Knowledge About Complex Systems Managing the Costs of Coordination
Classification schemes Formal role assignments
SREs Are Cognitive Agents Working in a Joint Cognitive System
The Calibration Problem
Mental Models Incidents Trigger Individual Recalibration Incidents Are Opportunities for Collective Recalibration
What Are the Implications of All This?
Incidents Will Continue Incidents Will Impose Costs Incident Patterns Will Change Incidents Point to Specific Calibration Problems and Locations
What Should Happen Next?
Build a Corpus of Cases Focus on Making Automation a Team Player in SRE Work Address the Calibration Problem
What Can You Do? Conclusion References
29. Beyond Burnout
Defining Mental Disorders Mental Disorders Are Missing from the Diversity Conversation Sanity Isn’t a Business Requirement Thoughts and Prayers Aren’t Scalable Full-Stack Inclusivity
Application Interviewing Compensation Benefits Onboarding Working Conditions Job Duties Training Promotion Leaving
Inclusivity for Anyone Helps Everyone Mental Disorder Resources
30. Against On-Call: A Polemic
The Rationale for On-Call
First, Do No Harm Parallels with SRE Differences with SRE Underlying Assumptions Driving On-Call for Engineers On-Call Is Emergency Medicine Instead of Ward Medicine Counterarguments
The Cost to Humans of Doing On-Call
We don’t need another hero
Actual Solutions
Training Prioritization
Accommodations Compensation Flexible schedules Recovery Exclusion backlash
Improving On-the-Job Performance
Cognitive hacks
We Need a Fundamental Change in Approach
Strong-Anti-On-Call Weak-Anti-On-Call A Union of the Two
Conclusion
31. Elegy for Complex Systems
The Computer and Human Systems Cannot Be Separated Decoherence and Cascading Failure Always in a State of Partial Failure Novelty Priority Inversion Nobody Anticipates the Overhead of Coordination Your healthcare.gov Is Out There
To Get Involved
Further Reading
32. Intersections Between Operations and Social Activism
Before, During, After
Creating the Perfect Plan Principles of Organizing
Principles 1 and 2 (interfaces and incident command) Principles 3 and 4 (blameless retrospectives and psychological safety)
Managing Crisis: Responding When Things Break Down
Handling chaos: contrast in responses during the July 8 KKK rally Preparing for the worst: handling terror at Unite the Right The corollary to trust is forgiveness
Writing Our Own History: Making Sense of What Went Down
Charlottesville in review: assigning and avoiding blame Beyond culpability: building capacity instead of assigning blame
The Long Tail: Turning Action into Change
Activism and Change Within a Company
Conclusion
33. Conclusion Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion