Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
About This eBook Title Page Copyright Page Contents at a Glance Contents Preface
About This Book Acknowledgments
Part I Design: Building It Part II Operations: Running It Part III Appendices
About the Authors Introduction
Business Objectives Ideal System Architecture Ideal Release Process Ideal Operations
Part I: Design: Building It
Chapter 1. Designing in a Distributed World
1.1 Visibility at Scale 1.2 The Importance of Simplicity 1.3 Composition
1.3.1 Load Balancer with Multiple Backend Replicas 1.3.2 Server with Multiple Backends 1.3.3 Server Tree
1.4 Distributed State 1.5 The CAP Principle
1.5.1 Consistency 1.5.2 Availability 1.5.3 Partition Tolerance
1.6 Loosely Coupled Systems 1.7 Speed 1.8 Summary Exercises
Chapter 2. Designing for Operations
2.1 Operational Requirements
2.1.1 Configuration 2.1.2 Startup and Shutdown 2.1.3 Queue Draining 2.1.4 Software Upgrades 2.1.5 Backups and Restores 2.1.6 Redundancy 2.1.7 Replicated Databases 2.1.8 Hot Swaps 2.1.9 Toggles for Individual Features 2.1.10 Graceful Degradation 2.1.11 Access Controls and Rate Limits 2.1.12 Data Import Controls 2.1.13 Monitoring 2.1.14 Auditing 2.1.15 Debug Instrumentation 2.1.16 Exception Collection 2.1.17 Documentation for Operations
2.2 Implementing Design for Operations
2.2.1 Build Features in from the Beginning 2.2.2 Request Features as They Are Identified 2.2.3 Write the Features Yourself 2.2.4 Work with a Third-Party Vendor
2.3 Improving the Model 2.4 Summary Exercises
Chapter 3. Selecting a Service Platform
3.1 Level of Service Abstraction
3.1.1 Infrastructure as a Service 3.1.2 Platform as a Service 3.1.3 Software as a Service
3.2 Type of Machine
3.2.1 Physical Machines 3.2.2 Virtual Machines 3.2.3 Containers
3.3 Level of Resource Sharing
3.3.1 Compliance 3.3.2 Privacy 3.3.3 Cost 3.3.4 Control
3.4 Colocation 3.5 Selection Strategies 3.6 Summary Exercises
Chapter 4. Application Architectures
4.1 Single-Machine Web Server 4.2 Three-Tier Web Service
4.2.1 Load Balancer Types 4.2.2 Load Balancing Methods 4.2.3 Load Balancing with Shared State 4.2.4 User Identity 4.2.5 Scaling
4.3 Four-Tier Web Service
4.3.1 Frontends 4.3.2 Application Servers 4.3.3 Configuration Options
4.4 Reverse Proxy Service 4.5 Cloud-Scale Service
4.5.1 Global Load Balancer 4.5.2 Global Load Balancing Methods 4.5.3 Global Load Balancing with User-Specific Data 4.5.4 Internal Backbone
4.6 Message Bus Architectures
4.6.1 Message Bus Designs 4.6.2 Message Bus Reliability 4.6.3 Example 1: Link-Shortening Site 4.6.4 Example 2: Employee Human Resources Data Updates
4.7 Service-Oriented Architecture
4.7.1 Flexibility 4.7.2 Support 4.7.3 Best Practices
4.8 Summary Exercises
Chapter 5. Design Patterns for Scaling
5.1 General Strategy
5.1.1 Identify Bottlenecks 5.1.2 Reengineer Components 5.1.3 Measure Results 5.1.4 Be Proactive
5.2 Scaling Up 5.3 The AKF Scaling Cube
5.3.1 x: Horizontal Duplication 5.3.2 y: Functional or Service Splits 5.3.3 z: Lookup-Oriented Split 5.3.4 Combinations
5.4 Caching
5.4.1 Cache Effectiveness 5.4.2 Cache Placement 5.4.3 Cache Persistence 5.4.4 Cache Replacement Algorithms 5.4.5 Cache Entry Invalidation 5.4.6 Cache Size
5.5 Data Sharding 5.6 Threading 5.7 Queueing
5.7.1 Benefits 5.7.2 Variations
5.8 Content Delivery Networks 5.9 Summary Exercises
Chapter 6. Design Patterns for Resiliency
6.1 Software Resiliency Beats Hardware Reliability 6.2 Everything Malfunctions Eventually
6.2.1 MTBF in Distributed Systems 6.2.2 The Traditional Approach 6.2.3 The Distributed Computing Approach
6.3 Resiliency through Spare Capacity
6.3.1 How Much Spare Capacity 6.3.2 Load Sharing versus Hot Spares
6.4 Failure Domains 6.5 Software Failures
6.5.1 Software Crashes 6.5.2 Software Hangs 6.5.3 Query of Death
6.6 Physical Failures
6.6.1 Parts and Components 6.6.2 Machines 6.6.3 Load Balancers 6.6.4 Racks 6.6.5 Datacenters
6.7 Overload Failures
6.7.1 Traffic Surges 6.7.2 DoS and DDoS Attacks 6.7.3 Scraping Attacks
6.8 Human Error 6.9 Summary Exercises
Part II Operations: Running It
Chapter 7. Operations in a Distributed World
7.1 Distributed Systems Operations
7.1.1 SRE versus Traditional Enterprise IT 7.1.2 Change versus Stability 7.1.3 Defining SRE 7.1.4 Operations at Scale
7.2 Service Life Cycle
7.2.1 Service Launches 7.2.2 Service Decommissioning
7.3 Organizing Strategy for Operational Teams
7.3.1 Team Member Day Types 7.3.2 Other Strategies
7.4 Virtual Office
7.4.1 Communication Mechanisms 7.4.2 Communication Policies
7.5 Summary Exercises
Chapter 8. DevOps Culture
8.1 What Is DevOps?
8.1.1 The Traditional Approach 8.1.2 The DevOps Approach
8.2 The Three Ways of DevOps
8.2.1 The First Way: Workflow 8.2.2 The Second Way: Improve Feedback 8.2.3 The Third Way: Continual Experimentation and Learning 8.2.4 Small Batches Are Better 8.2.5 Adopting the Strategies
8.3 History of DevOps
8.3.1 Evolution 8.3.2 Site Reliability Engineering
8.4 DevOps Values and Principles
8.4.1 Relationships 8.4.2 Integration 8.4.3 Automation 8.4.4 Continuous Improvement 8.4.5 Common Nontechnical DevOps Practices 8.4.6 Common Technical DevOps Practices 8.4.7 Release Engineering DevOps Practices
8.5 Converting to DevOps
8.5.1 Getting Started 8.5.2 DevOps at the Business Level
8.6 Agile and Continuous Delivery
8.6.1 What Is Agile? 8.6.2 What Is Continuous Delivery?
8.7 Summary Exercises
Chapter 9. Service Delivery: The Build Phase
9.1 Service Delivery Strategies
9.1.1 Pattern: Modern DevOps Methodology 9.1.2 Anti-pattern: Waterfall Methodology
9.2 The Virtuous Cycle of Quality 9.3 Build-Phase Steps
9.3.1 Develop 9.3.2 Commit 9.3.3 Build 9.3.4 Package 9.3.5 Register
9.4 Build Console 9.5 Continuous Integration 9.6 Packages as Handoff Interface 9.7 Summary Exercises
Chapter 10. Service Delivery: The Deployment Phase
10.1 Deployment-Phase Steps
10.1.1 Promotion 10.1.2 Installation 10.1.3 Configuration
10.2 Testing and Approval
10.2.1 Testing 10.2.2 Approval
10.3 Operations Console 10.4 Infrastructure Automation Strategies
10.4.1 Preparing Physical Machines 10.4.2 Preparing Virtual Machines 10.4.3 Installing OS and Services
10.5 Continuous Delivery 10.6 Infrastructure as Code 10.7 Other Platform Services 10.8 Summary Exercises
Chapter 11. Upgrading Live Services
11.1 Taking the Service Down for Upgrading 11.2 Rolling Upgrades 11.3 Canary 11.4 Phased Roll-outs 11.5 Proportional Shedding 11.6 Blue-Green Deployment 11.7 Toggling Features 11.8 Live Schema Changes 11.9 Live Code Changes 11.10 Continuous Deployment 11.11 Dealing with Failed Code Pushes 11.12 Release Atomicity 11.13 Summary Exercises
Chapter 12. Automation
12.1 Approaches to Automation
12.1.1 The Left-Over Principle 12.1.2 The Compensatory Principle 12.1.3 The Complementarity Principle 12.1.4 Automation for System Administration 12.1.5 Lessons Learned
12.2 Tool Building versus Automation
12.2.1 Example: Auto Manufacturing 12.2.2 Example: Machine Configuration 12.2.3 Example: Account Creation 12.2.4 Tools Are Good, But Automation Is Better
12.3 Goals of Automation 12.4 Creating Automation
12.4.1 Making Time to Automate 12.4.2 Reducing Toil 12.4.3 Determining What to Automate First
12.5 How to Automate 12.6 Language Tools
12.6.1 Shell Scripting Languages 12.6.2 Scripting Languages 12.6.3 Compiled Languages 12.6.4 Configuration Management Languages
12.7 Software Engineering Tools and Techniques
12.7.1 Issue Tracking Systems 12.7.2 Version Control Systems 12.7.3 Software Packaging 12.7.4 Style Guides 12.7.5 Test-Driven Development 12.7.6 Code Reviews 12.7.7 Writing Just Enough Code
12.8 Multitenant Systems 12.9 Summary Exercises
Chapter 13. Design Documents
13.1 Design Documents Overview
13.1.1 Documenting Changes and Rationale 13.1.2 Documentation as a Repository of Past Decisions
13.2 Design Document Anatomy 13.3 Template 13.4 Document Archive 13.5 Review Workflows
13.5.1 Reviewers and Approvers 13.5.2 Achieving Sign-off
13.6 Adopting Design Documents 13.7 Summary Exercises
Chapter 14. Oncall
14.1 Designing Oncall
14.1.1 Start with the SLA 14.1.2 Oncall Roster 14.1.3 Onduty 14.1.4 Oncall Schedule Design 14.1.5 The Oncall Calendar 14.1.6 Oncall Frequency 14.1.7 Types of Notifications 14.1.8 After-Hours Maintenance Coordination
14.2 Being Oncall
14.2.1 Pre-shift Responsibilities 14.2.2 Regular Oncall Responsibilities 14.2.3 Alert Responsibilities 14.2.4 Observe, Orient, Decide, Act (OODA) 14.2.5 Oncall Playbook 14.2.6 Third-Party Escalation 14.2.7 End-of-Shift Responsibilities
14.3 Between Oncall Shifts
14.3.1 Long-Term Fixes 14.3.2 Postmortems
14.4 Periodic Review of Alerts 14.5 Being Paged Too Much 14.6 Summary Exercises
Chapter 15. Disaster Preparedness
15.1 Mindset
15.1.1 Antifragile Systems 15.1.2 Reducing Risk
15.2 Individual Training: Wheel of Misfortune 15.3 Team Training: Fire Drills
15.3.1 Service Testing 15.3.2 Random Testing
15.4 Training for Organizations: Game Day/DiRT
15.4.1 Getting Started 15.4.2 Increasing Scope 15.4.3 Implementation and Logistics 15.4.4 Experiencing a DiRT Test
15.5 Incident Command System
15.5.1 How It Works: Public Safety Arena 15.5.2 How It Works: IT Operations Arena 15.5.3 Incident Action Plan 15.5.4 Best Practices 15.5.5 ICS Example
15.6 Summary Exercises
Chapter 16. Monitoring Fundamentals
16.1 Overview
16.1.1 Uses of Monitoring 16.1.2 Service Management
16.2 Consumers of Monitoring Information 16.3 What to Monitor 16.4 Retention 16.5 Meta-monitoring 16.6 Logs
16.6.1 Approach 16.6.2 Timestamps
16.7 Summary Exercises
Chapter 17. Monitoring Architecture and Practice
17.1 Sensing and Measurement
17.1.1 Blackbox versus Whitebox Monitoring 17.1.2 Direct versus Synthesized Measurements 17.1.3 Rate versus Capability Monitoring 17.1.4 Gauges versus Counters
17.2 Collection
17.2.1 Push versus Pull 17.2.2 Protocol Selection 17.2.3 Server Component versus Agent versus Poller 17.2.4 Central versus Regional Collectors
17.3 Analysis and Computation 17.4 Alerting and Escalation Manager
17.4.1 Alerting, Escalation, and Acknowledgments 17.4.2 Silence versus Inhibit
17.5 Visualization
17.5.1 Percentiles 17.5.2 Stack Ranking 17.5.3 Histograms
17.6 Storage 17.7 Configuration 17.8 Summary Exercises
Chapter 18. Capacity Planning
18.1 Standard Capacity Planning
18.1.1 Current Usage 18.1.2 Normal Growth 18.1.3 Planned Growth 18.1.4 Headroom 18.1.5 Resiliency 18.1.6 Timetable
18.2 Advanced Capacity Planning
18.2.1 Identifying Your Primary Resources 18.2.2 Knowing Your Capacity Limits 18.2.3 Identifying Your Core Drivers 18.2.4 Measuring Engagement 18.2.5 Analyzing the Data 18.2.6 Monitoring the Key Indicators 18.2.7 Delegating Capacity Planning
18.3 Resource Regression 18.4 Launching New Services 18.5 Reduce Provisioning Time 18.6 Summary Exercises
Chapter 19. Creating KPIs
19.1 What Is a KPI? 19.2 Creating KPIs
19.2.1 Step 1: Envision the Ideal 19.2.2 Step 2: Quantify Distance to the Ideal 19.2.3 Step 3: Imagine How Behavior Will Change 19.2.4 Step 4: Revise and Select 19.2.5 Step 5: Deploy the KPI
19.3 Example KPI: Machine Allocation
19.3.1 The First Pass 19.3.2 The Second Pass 19.3.3 Evaluating the KPI
19.4 Case Study: Error Budget
19.4.1 Conflicting Goals 19.4.2 A Unified Goal 19.4.3 Everyone Benefits
19.5 Summary Exercises
Chapter 20. Operational Excellence
20.1 What Does Operational Excellence Look Like? 20.2 How to Measure Greatness 20.3 Assessment Methodology
20.3.1 Operational Responsibilities 20.3.2 Assessment Levels 20.3.3 Assessment Questions and Look-For’s
20.4 Service Assessments
20.4.1 Identifying What to Assess 20.4.2 Assessing Each Service 20.4.3 Comparing Results across Services 20.4.4 Acting on the Results 20.4.5 Assessment and Project Planning Frequencies
20.5 Organizational Assessments 20.6 Levels of Improvement 20.7 Getting Started 20.8 Summary Exercises
Epilogue
Part III Appendices
Appendix A. Assessments
A.1 Regular Tasks (RT)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.2 Emergency Response (ER)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.3 Monitoring and Metrics (MM)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.4 Capacity Planning (CP)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.5 Change Management (CM)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.6 New Product Introduction and Removal (NPI/NPR)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.7 Service Deployment and Decommissioning (SDD)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.8 Performance and Efficiency (PE)
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.9 Service Delivery: The Build Phase
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.10 Service Delivery: The Deployment Phase
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.11 Toil Reduction
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
A.12 Disaster Preparedness
Sample Assessment Questions Level 1: Initial Level 2: Repeatable Level 3: Defined Level 4: Managed Level 5: Optimizing
Appendix B. The Origins and Future of Distributed Computing and Clouds
B.1 The Pre-Web Era (1985–1994)
Availability Requirements Technology Scaling High Availability Costs
B.2 The First Web Era: The Bubble (1995–2000)
Availability Requirements Technology Scaling High Availability N + 1 Configurations N + 2 Configurations Costs
B.3 The Dot-Bomb Era (2000–2003) Availability Requirements
Technology High Availability Scaling Data Scaling Applicability Costs
B.4 The Second Web Era (2003–2010) Availability Requirements
Technology High Availability Scaling Costs
B.5 The Cloud Computing Era (2010–present)
Availability Requirements Costs
Scaling and High Availability
Technology
B.6 Conclusion Exercises
Appendix C. Scaling Terminology and Concepts
C.1 Constant, Linear, and Exponential Scaling C.2 Big O Notation C.3 Limitations of Big O Notation
Appendix D. Templates and Examples
D.1 Design Document Template D.2 Design Document Example D.3 Sample Postmortem Template
Appendix E. Recommended Reading
DevOps: ITIL: Theory: Classic Google Papers: Classic Facebook Papers: Scalability: UNIX Internals: UNIX Systems Programming: Network Protocols:
Bibliography Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion