Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Foreword Preface
Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments
I. Introduction 1. Introduction
The Sysadmin Approach to Service Management Google’s Approach to Service Management: Site Reliability Engineering Tenets of SRE
Ensuring a Durable Focus on Engineering Pursuing Maximum Change Velocity Without Violating a Service’s SLO Monitoring Emergency Response Change Management Demand Forecasting and Capacity Planning Provisioning Efficiency and Performance
The End of the Beginning
2. The Production Environment at Google, from the Viewpoint of an SRE
Hardware System Software That “Organizes” the Hardware
Managing Machines Storage Networking
Other System Software
Lock Service Monitoring and Alerting
Our Software Infrastructure Our Development Environment Shakespeare: A Sample Service
Life of a Request Job and Data Organization
II. Principles 3. Embracing Risk
Managing Risk Measuring Service Risk Risk Tolerance of Services
Identifying the Risk Tolerance of Consumer Services
Target level of availability Types of failures Cost Other service metrics
Identifying the Risk Tolerance of Infrastructure Services
Target level of availability Types of failures Cost Example: Frontend infrastructure
Motivation for Error Budgets
Forming Your Error Budget Benefits
4. Service Level Objectives
Service Level Terminology
Indicators Objectives Agreements
Indicators in Practice
What Do You and Your Users Care About? Collecting Indicators Aggregation Standardize Indicators
Objectives in Practice
Defining Objectives Choosing Targets Control Measures SLOs Set Expectations
Agreements in Practice
5. Eliminating Toil
Toil Defined Why Less Toil Is Better What Qualifies as Engineering? Is Toil Always Bad? Conclusion
6. Monitoring Distributed Systems
Definitions Why Monitor? Setting Reasonable Expectations for Monitoring Symptoms Versus Causes Black-Box Versus White-Box The Four Golden Signals Worrying About Your Tail (or, Instrumentation and Performance) Choosing an Appropriate Resolution for Measurements As Simple as Possible, No Simpler Tying These Principles Together Monitoring for the Long Term
Bigtable SRE: A Tale of Over-Alerting Gmail: Predictable, Scriptable Responses from Humans The Long Run
Conclusion
7. The Evolution of Automation at Google
The Value of Automation
Consistency A Platform Faster Repairs Faster Action Time Saving
The Value for Google SRE The Use Cases for Automation
Google SRE’s Use Cases for Automation A Hierarchy of Automation Classes
Automate Yourself Out of a Job: Automate ALL the Things! Soothing the Pain: Applying Automation to Cluster Turnups
Detecting Inconsistencies with Prodtest Resolving Inconsistencies Idempotently The Inclination to Specialize Service-Oriented Cluster-Turnup
Borg: Birth of the Warehouse-Scale Computer Reliability Is the Fundamental Feature Recommendations
8. Release Engineering
The Role of a Release Engineer Philosophy
Self-Service Model High Velocity Hermetic Builds Enforcement of Policies and Procedures
Continuous Build and Deployment
Building Branching Testing Packaging Rapid Deployment
Configuration Management Conclusions
It’s Not Just for Googlers Start Release Engineering at the Beginning
9. Simplicity
System Stability Versus Agility The Virtue of Boring I Won’t Give Up My Code! The “Negative Lines of Code” Metric Minimal APIs Modularity Release Simplicity A Simple Conclusion
III. Practices 10. Practical Alerting from Time-Series Data
The Rise of Borgmon Instrumentation of Applications Collection of Exported Data Storage in the Time-Series Arena
Labels and Vectors
Rule Evaluation Alerting Sharding the Monitoring Topology Black-Box Monitoring Maintaining the Configuration Ten Years On…
11. Being On-Call
Introduction Life of an On-Call Engineer Balanced On-Call
Balance in Quantity Balance in Quality Compensation
Feeling Safe Avoiding Inappropriate Operational Load
Operational Overload A Treacherous Enemy: Operational Underload
Conclusions
12. Effective Troubleshooting
Theory In Practice
Problem Report Triage Examine Diagnose
Simplify and reduce Ask “what,” “where,” and “why” What touched it last Specific diagnoses
Test and Treat
Negative Results Are Magic
Cure
Case Study Making Troubleshooting Easier Conclusion
13. Emergency Response
What to Do When Systems Break Test-Induced Emergency
Details Response Findings
What went well What we learned
Change-Induced Emergency
Details Response Findings
What went well What we learned
Process-Induced Emergency
Details Response Findings
What went well What we learned
All Problems Have Solutions Learn from the Past. Don’t Repeat It.
Keep a History of Outages Ask the Big, Even Improbable, Questions: What If…? Encourage Proactive Testing
Conclusion
14. Managing Incidents
Unmanaged Incidents The Anatomy of an Unmanaged Incident
Sharp Focus on the Technical Problem Poor Communication Freelancing
Elements of Incident Management Process
Recursive Separation of Responsibilities A Recognized Command Post Live Incident State Document Clear, Live Handoff
A Managed Incident When to Declare an Incident In Summary
15. Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy Collaborate and Share Knowledge Introducing a Postmortem Culture Conclusion and Ongoing Improvements
16. Tracking Outages
Escalator Outalator
Aggregation Tagging Analysis
Reporting and communication
Unexpected Benefits
17. Testing for Reliability
Types of Software Testing
Traditional Tests
Unit tests Integration tests System tests
Production Tests
Configuration test Stress test Canary test
Creating a Test and Build Environment Testing at Scale
Testing Scalable Tools Testing Disaster The Need for Speed Pushing to Production Expect Testing Fail Integration Production Probes
Conclusion
18. Software Engineering in SRE
Why Is Software Engineering Within SRE Important? Auxon Case Study: Project Background and Problem Space
Traditional Capacity Planning
Brittle by nature Laborious and imprecise
Our Solution: Intent-Based Capacity Planning
Intent-Based Capacity Planning
Precursors to Intent
Dependencies Performance metrics Prioritization
Introduction to Auxon Requirements and Implementation: Successes and Lessons Learned
Approximation
Raising Awareness and Driving Adoption
Set expectations Identify appropriate customers Customer service Designing at the right level
Team Dynamics
Fostering Software Engineering in SRE
Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time Getting There
Conclusions
19. Load Balancing at the Frontend
Power Isn’t the Answer Load Balancing Using DNS Load Balancing at the Virtual IP Address
20. Load Balancing in the Datacenter
The Ideal Case Identifying Bad Tasks: Flow Control and Lame Ducks
A Simple Approach to Unhealthy Tasks: Flow Control A Robust Approach to Unhealthy Tasks: Lame Duck State
Limiting the Connections Pool with Subsetting
Picking the Right Subset A Subset Selection Algorithm: Random Subsetting A Subset Selection Algorithm: Deterministic Subsetting
Load Balancing Policies
Simple Round Robin
Small subsetting Varying query costs Machine diversity Unpredictable performance factors
Least-Loaded Round Robin Weighted Round Robin
21. Handling Overload
The Pitfalls of “Queries per Second” Per-Customer Limits Client-Side Throttling Criticality Utilization Signals Handling Overload Errors
Deciding to Retry
Load from Connections Conclusions
22. Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Server Overload Resource Exhaustion
CPU Memory Threads File descriptors Dependencies among resources
Service Unavailability
Preventing Server Overload
Queue Management Load Shedding and Graceful Degradation Retries Latency and Deadlines
Picking a deadline Missing deadlines Deadline propagation Bimodal latency
Slow Startup and Cold Caching
Always Go Downward in the Stack
Triggering Conditions for Cascading Failures
Process Death Process Updates New Rollouts Organic Growth Planned Changes, Drains, or Turndowns
Request profile changes Resource limits
Testing for Cascading Failures
Test Until Failure and Beyond Test Popular Clients Test Noncritical Backends
Immediate Steps to Address Cascading Failures
Increase Resources Stop Health Check Failures/Deaths Restart Servers Drop Traffic Enter Degraded Modes Eliminate Batch Load Eliminate Bad Traffic
Closing Remarks
23. Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
Case Study 1: The Split-Brain Problem Case Study 2: Failover Requires Human Intervention Case Study 3: Faulty Group-Membership Algorithms
How Distributed Consensus Works
Paxos Overview: An Example Protocol
System Architecture Patterns for Distributed Consensus
Reliable Replicated State Machines Reliable Replicated Datastores and Configuration Stores Highly Available Processing Using Leader Election Distributed Coordination and Locking Services Reliable Distributed Queuing and Messaging
Distributed Consensus Performance
Multi-Paxos: Detailed Message Flow Scaling Read-Heavy Workloads Quorum Leases Distributed Consensus Performance and Network Latency Reasoning About Performance: Fast Paxos Stable Leaders Batching Disk Access
Deploying Distributed Consensus-Based Systems
Number of Replicas Location of Replicas Capacity and Load Balancing
Quorum composition
Monitoring Distributed Consensus Systems Conclusion
24. Distributed Periodic Scheduling with Cron
Cron
Introduction Reliability Perspective
Cron Jobs and Idempotency Cron at Large Scale
Extended Infrastructure Extended Requirements
Building Cron at Google
Tracking the State of Cron Jobs The Use of Paxos The Roles of the Leader and the Follower
The leader The follower Resolving partial failures
Storing the State Running Large Cron
Summary
25. Data Processing Pipelines
Origin of the Pipeline Design Pattern Initial Effect of Big Data on the Simple Pipeline Pattern Challenges with the Periodic Pipeline Pattern Trouble Caused By Uneven Work Distribution Drawbacks of Periodic Pipelines in Distributed Environments
Monitoring Problems in Periodic Pipelines “Thundering Herd” Problems Moiré Load Pattern
Introduction to Google Workflow
Workflow as Model-View-Controller Pattern
Stages of Execution in Workflow
Workflow Correctness Guarantees
Ensuring Business Continuity Summary and Concluding Remarks
26. Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Choosing a Strategy for Superior Data Integrity Backups Versus Archives Requirements of the Cloud Environment in Perspective
Google SRE Objectives in Maintaining Data Integrity and Availability
Data Integrity Is the Means; Data Availability Is the Goal Delivering a Recovery System, Rather Than a Backup System Types of Failures That Lead to Data Loss Challenges of Maintaining Data Integrity Deep and Wide
Scaling issues: Fulls, incrementals, and the competing forces of backups and restores Retention
How Google SRE Faces the Challenges of Data Integrity
The 24 Combinations of Data Integrity Failure Modes First Layer: Soft Deletion Second Layer: Backups and Their Related Recovery Methods Overarching Layer: Replication 1T Versus 1E: Not “Just” a Bigger Backup Third Layer: Early Detection
Challenges faced by cloud developers Out-of-band data validation
Knowing That Data Recovery Will Work
Case Studies
Gmail—February, 2011: Restore from GTape
Sunday, February 27, 2011, late in the evening
Google Music—March 2012: Runaway Deletion Detection
Tuesday, March 6th, 2012, mid-afternoon Discovering the problem Assessing the damage Resolving the issue
Parallel bug identification and recovery efforts First wave of recovery Second wave of recovery
Addressing the root cause
General Principles of SRE as Applied to Data Integrity
Beginner’s Mind Trust but Verify Hope Is Not a Strategy Defense in Depth
Conclusion
27. Reliable Product Launches at Scale
Launch Coordination Engineering
The Role of the Launch Coordination Engineer
Setting Up a Launch Process
The Launch Checklist Driving Convergence and Simplification Launching the Unexpected
Developing a Launch Checklist
Architecture and Dependencies
Example checklist questions Example action items
Integration
Example action items
Capacity Planning
Example checklist questions
Failure Modes
Example checklist questions Example action items
Client Behavior
Example checklist question Example action items
Processes and Automation
Example checklist question Example action items
Development Process
Example action items
External Dependencies
Example checklist questions
Rollout Planning
Example action items
Selected Techniques for Reliable Launches
Gradual and Staged Rollouts Feature Flag Frameworks Dealing with Abusive Client Behavior Overload Behavior and Load Tests
Development of LCE
Evolution of the LCE Checklist Problems LCE Didn’t Solve
Scalability changes Growing operational load Infrastructure churn
Conclusion
IV. Management 28. Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What? Initial Learning Experiences: The Case for Structure Over Chaos
Learning Paths That Are Cumulative and Orderly Targeted Project Work, Not Menial Work
Creating Stellar Reverse Engineers and Improvisational Thinkers
Reverse Engineers: Figuring Out How Things Work Statistical and Comparative Thinkers: Stewards of the Scientific Method Under Pressure Improv Artists: When the Unexpected Happens Tying This Together: Reverse Engineering a Production Service
Five Practices for Aspiring On-Callers
A Hunger for Failure: Reading and Sharing Postmortems Disaster Role Playing Break Real Things, Fix Real Things Documentation as Apprenticeship Shadow On-Call Early and Often
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education Closing Thoughts
29. Dealing with Interrupts
Managing Operational Load Factors in Determining How Interrupts Are Handled Imperfect Machines
Cognitive Flow State
Cognitive flow state: Creative and engaged Cognitive flow state: Angry Birds
Do One Thing Well
Distractibility Polarizing time
Seriously, Tell Me What to Do
General suggestions On-call Tickets Ongoing responsibilities Be on interrupts, or don’t be
Reducing Interrupts
Actually analyze tickets Respect yourself, as well as your customers
30. Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Identify the Largest Sources of Stress Identify Kindling
Phase 2: Sharing Context
Write a Good Postmortem for the Team Sort Fires According to Type
Phase 3: Driving Change
Start with the Basics Get Help Clearing Kindling Explain Your Reasoning Ask Leading Questions
Conclusion
31. Communication and Collaboration in SRE
Communications: Production Meetings
Agenda Attendance
Collaboration within SRE
Team Composition Techniques for Working Effectively
Case Study of Collaboration in SRE: Viceroy
The Coming of the Viceroy Challenges Recommendations
Collaboration Outside SRE Case Study: Migrating DFP to F1 Conclusion
32. The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why The PRR Model The SRE Engagement Model
Alternative Support
Documentation Consultation
Production Readiness Reviews: Simple PRR Model
Engagement Analysis Improvements and Refactoring Training Onboarding Continuous Improvement
Evolving the Simple PRR Model: Early Engagement
Candidates for Early Engagement Benefits of the Early Engagement Model
Design phase Build and implementation Launch Post-launch Disengaging from a service
Evolving Services Development: Frameworks and SRE Platform
Lessons Learned External Factors Affecting SRE Toward a Structural Solution: Frameworks New Service and Management Benefits
Significantly lower operational overhead Universal support by design Faster, lower overhead engagements A new engagement model based on shared responsibility
Conclusion
V. Conclusions 33. Lessons Learned from Other Industries
Meet Our Industry Veterans Preparedness and Disaster Testing
Relentless Organizational Focus on Safety Attention to Detail Swing Capacity Simulations and Live Drills Training and Certification Focus on Detailed Requirements Gathering and Design Defense in Depth and Breadth
Postmortem Culture Automating Away Repetitive Work and Operational Overhead Structured and Rational Decision Making Conclusions
34. Conclusion A. Availability Table B. A Collection of Best Practices for Production Services
Fail Sanely Progressive Rollouts Define SLOs Like a User Error Budgets Monitoring Postmortems Capacity Planning Overloads and Failure SRE Teams
C. Example Incident State Document D. Example Postmortem
Lessons Learned
What went well What went wrong Where we got lucky
Timeline Supporting information:
E. Launch Coordination Checklist F. Example Production Meeting Minutes Bibliography Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion