The Practice of Cloud System Administration · Designing and Operating Large Distributed Systems, Volume 2 by Limoncelli, Thomas A. -- Read -- Imperial Library of Trantor

Index

About This eBook Title Page Copyright Page Contents at a Glance Contents Preface

About This Book Acknowledgments

Part I Design: Building It Part II Operations: Running It Part III Appendices

About the Authors Introduction

Business Objectives Ideal System Architecture Ideal Release Process Ideal Operations

Part I: Design: Building It

Chapter 1. Designing in a Distributed World

1.1 Visibility at Scale 1.2 The Importance of Simplicity 1.3 Composition

1.3.1 Load Balancer with Multiple Backend Replicas 1.3.2 Server with Multiple Backends 1.3.3 Server Tree

1.4 Distributed State 1.5 The CAP Principle

1.5.1 Consistency 1.5.2 Availability 1.5.3 Partition Tolerance

1.6 Loosely Coupled Systems 1.7 Speed 1.8 Summary Exercises

Chapter 2. Designing for Operations

2.1 Operational Requirements

2.1.1 Configuration 2.1.2 Startup and Shutdown 2.1.3 Queue Draining 2.1.4 Software Upgrades 2.1.5 Backups and Restores 2.1.6 Redundancy 2.1.7 Replicated Databases 2.1.8 Hot Swaps 2.1.9 Toggles for Individual Features 2.1.10 Graceful Degradation 2.1.11 Access Controls and Rate Limits 2.1.12 Data Import Controls 2.1.13 Monitoring 2.1.14 Auditing 2.1.15 Debug Instrumentation 2.1.16 Exception Collection 2.1.17 Documentation for Operations

2.2 Implementing Design for Operations

2.2.1 Build Features in from the Beginning 2.2.2 Request Features as They Are Identified 2.2.3 Write the Features Yourself 2.2.4 Work with a Third-Party Vendor

2.3 Improving the Model 2.4 Summary Exercises

Chapter 3. Selecting a Service Platform

3.1 Level of Service Abstraction

3.1.1 Infrastructure as a Service 3.1.2 Platform as a Service 3.1.3 Software as a Service

3.2 Type of Machine

3.2.1 Physical Machines 3.2.2 Virtual Machines 3.2.3 Containers

3.3 Level of Resource Sharing

3.3.1 Compliance 3.3.2 Privacy 3.3.3 Cost 3.3.4 Control

3.4 Colocation 3.5 Selection Strategies 3.6 Summary Exercises

Chapter 4. Application Architectures

4.1 Single-Machine Web Server 4.2 Three-Tier Web Service

4.2.1 Load Balancer Types 4.2.2 Load Balancing Methods 4.2.3 Load Balancing with Shared State 4.2.4 User Identity 4.2.5 Scaling

4.3 Four-Tier Web Service

4.3.1 Frontends 4.3.2 Application Servers 4.3.3 Configuration Options

4.4 Reverse Proxy Service 4.5 Cloud-Scale Service

4.5.1 Global Load Balancer 4.5.2 Global Load Balancing Methods 4.5.3 Global Load Balancing with User-Specific Data 4.5.4 Internal Backbone

4.6 Message Bus Architectures

4.6.1 Message Bus Designs 4.6.2 Message Bus Reliability 4.6.3 Example 1: Link-Shortening Site 4.6.4 Example 2: Employee Human Resources Data Updates

4.7 Service-Oriented Architecture

4.7.1 Flexibility 4.7.2 Support 4.7.3 Best Practices

4.8 Summary Exercises

Chapter 5. Design Patterns for Scaling

5.1 General Strategy

5.1.1 Identify Bottlenecks 5.1.2 Reengineer Components 5.1.3 Measure Results 5.1.4 Be Proactive

5.2 Scaling Up 5.3 The AKF Scaling Cube

5.3.1 x: Horizontal Duplication 5.3.2 y: Functional or Service Splits 5.3.3 z: Lookup-Oriented Split 5.3.4 Combinations

5.4 Caching

5.4.1 Cache Effectiveness 5.4.2 Cache Placement 5.4.3 Cache Persistence 5.4.4 Cache Replacement Algorithms 5.4.5 Cache Entry Invalidation 5.4.6 Cache Size

5.5 Data Sharding 5.6 Threading 5.7 Queueing

5.7.1 Benefits 5.7.2 Variations

5.8 Content Delivery Networks 5.9 Summary Exercises

Chapter 6. Design Patterns for Resiliency

6.1 Software Resiliency Beats Hardware Reliability 6.2 Everything Malfunctions Eventually

6.2.1 MTBF in Distributed Systems 6.2.2 The Traditional Approach 6.2.3 The Distributed Computing Approach

6.3 Resiliency through Spare Capacity

6.3.1 How Much Spare Capacity 6.3.2 Load Sharing versus Hot Spares

6.4 Failure Domains 6.5 Software Failures

6.5.1 Software Crashes 6.5.2 Software Hangs 6.5.3 Query of Death

6.6 Physical Failures

6.6.1 Parts and Components 6.6.2 Machines 6.6.3 Load Balancers 6.6.4 Racks 6.6.5 Datacenters

6.7 Overload Failures

6.7.1 Traffic Surges 6.7.2 DoS and DDoS Attacks 6.7.3 Scraping Attacks

6.8 Human Error 6.9 Summary Exercises

Part II Operations: Running It

Chapter 7. Operations in a Distributed World

7.1 Distributed Systems Operations

7.1.1 SRE versus Traditional Enterprise IT 7.1.2 Change versus Stability 7.1.3 Defining SRE 7.1.4 Operations at Scale

7.2 Service Life Cycle

7.2.1 Service Launches 7.2.2 Service Decommissioning

7.3 Organizing Strategy for Operational Teams

7.3.1 Team Member Day Types 7.3.2 Other Strategies

7.4 Virtual Office

7.4.1 Communication Mechanisms 7.4.2 Communication Policies

7.5 Summary Exercises

Chapter 8. DevOps Culture

8.1 What Is DevOps?

8.1.1 The Traditional Approach 8.1.2 The DevOps Approach

8.2 The Three Ways of DevOps

8.2.1 The First Way: Workflow 8.2.2 The Second Way: Improve Feedback 8.2.3 The Third Way: Continual Experimentation and Learning 8.2.4 Small Batches Are Better 8.2.5 Adopting the Strategies

8.3 History of DevOps

8.3.1 Evolution 8.3.2 Site Reliability Engineering

8.4 DevOps Values and Principles

8.4.1 Relationships 8.4.2 Integration 8.4.3 Automation 8.4.4 Continuous Improvement 8.4.5 Common Nontechnical DevOps Practices 8.4.6 Common Technical DevOps Practices 8.4.7 Release Engineering DevOps Practices

8.5 Converting to DevOps

8.5.1 Getting Started 8.5.2 DevOps at the Business Level

8.6 Agile and Continuous Delivery

8.6.1 What Is Agile? 8.6.2 What Is Continuous Delivery?

8.7 Summary Exercises

Chapter 9. Service Delivery: The Build Phase

9.1 Service Delivery Strategies

9.1.1 Pattern: Modern DevOps Methodology 9.1.2 Anti-pattern: Waterfall Methodology

9.2 The Virtuous Cycle of Quality 9.3 Build-Phase Steps

9.3.1 Develop 9.3.2 Commit 9.3.3 Build 9.3.4 Package 9.3.5 Register

9.4 Build Console 9.5 Continuous Integration 9.6 Packages as Handoff Interface 9.7 Summary Exercises

Chapter 10. Service Delivery: The Deployment Phase

10.1 Deployment-Phase Steps

10.1.1 Promotion 10.1.2 Installation 10.1.3 Configuration

10.2 Testing and Approval

10.2.1 Testing 10.2.2 Approval

10.3 Operations Console 10.4 Infrastructure Automation Strategies

10.4.1 Preparing Physical Machines 10.4.2 Preparing Virtual Machines 10.4.3 Installing OS and Services

10.5 Continuous Delivery 10.6 Infrastructure as Code 10.7 Other Platform Services 10.8 Summary Exercises

Chapter 11. Upgrading Live Services

11.1 Taking the Service Down for Upgrading 11.2 Rolling Upgrades 11.3 Canary 11.4 Phased Roll-outs 11.5 Proportional Shedding 11.6 Blue-Green Deployment 11.7 Toggling Features 11.8 Live Schema Changes 11.9 Live Code Changes 11.10 Continuous Deployment 11.11 Dealing with Failed Code Pushes 11.12 Release Atomicity 11.13 Summary Exercises

Chapter 12. Automation

12.1 Approaches to Automation

12.1.1 The Left-Over Principle 12.1.2 The Compensatory Principle 12.1.3 The Complementarity Principle 12.1.4 Automation for System Administration 12.1.5 Lessons Learned

12.2 Tool Building versus Automation

12.2.1 Example: Auto Manufacturing 12.2.2 Example: Machine Configuration 12.2.3 Example: Account Creation 12.2.4 Tools Are Good, But Automation Is Better

12.3 Goals of Automation 12.4 Creating Automation

12.4.1 Making Time to Automate 12.4.2 Reducing Toil 12.4.3 Determining What to Automate First

12.5 How to Automate 12.6 Language Tools

12.6.1 Shell Scripting Languages 12.6.2 Scripting Languages 12.6.3 Compiled Languages 12.6.4 Configuration Management Languages

12.7 Software Engineering Tools and Techniques

12.7.1 Issue Tracking Systems 12.7.2 Version Control Systems 12.7.3 Software Packaging 12.7.4 Style Guides 12.7.5 Test-Driven Development 12.7.6 Code Reviews 12.7.7 Writing Just Enough Code

12.8 Multitenant Systems 12.9 Summary Exercises

Chapter 13. Design Documents

13.1 Design Documents Overview

13.1.1 Documenting Changes and Rationale 13.1.2 Documentation as a Repository of Past Decisions

13.2 Design Document Anatomy 13.3 Template 13.4 Document Archive 13.5 Review Workflows

13.5.1 Reviewers and Approvers 13.5.2 Achieving Sign-off

13.6 Adopting Design Documents 13.7 Summary Exercises

Chapter 14. Oncall

14.1 Designing Oncall

14.1.1 Start with the SLA 14.1.2 Oncall Roster 14.1.3 Onduty 14.1.4 Oncall Schedule Design 14.1.5 The Oncall Calendar 14.1.6 Oncall Frequency 14.1.7 Types of Notifications 14.1.8 After-Hours Maintenance Coordination

14.2 Being Oncall

14.2.1 Pre-shift Responsibilities 14.2.2 Regular Oncall Responsibilities 14.2.3 Alert Responsibilities 14.2.4 Observe, Orient, Decide, Act (OODA) 14.2.5 Oncall Playbook 14.2.6 Third-Party Escalation 14.2.7 End-of-Shift Responsibilities

14.3 Between Oncall Shifts

14.3.1 Long-Term Fixes 14.3.2 Postmortems

14.4 Periodic Review of Alerts 14.5 Being Paged Too Much 14.6 Summary Exercises

Chapter 15. Disaster Preparedness

15.1 Mindset

15.1.1 Antifragile Systems 15.1.2 Reducing Risk

15.2 Individual Training: Wheel of Misfortune 15.3 Team Training: Fire Drills

15.3.1 Service Testing 15.3.2 Random Testing

15.4 Training for Organizations: Game Day/DiRT

15.4.1 Getting Started 15.4.2 Increasing Scope 15.4.3 Implementation and Logistics 15.4.4 Experiencing a DiRT Test

15.5 Incident Command System

15.5.1 How It Works: Public Safety Arena 15.5.2 How It Works: IT Operations Arena 15.5.3 Incident Action Plan 15.5.4 Best Practices 15.5.5 ICS Example

15.6 Summary Exercises

Chapter 16. Monitoring Fundamentals

16.1 Overview

16.1.1 Uses of Monitoring 16.1.2 Service Management

16.2 Consumers of Monitoring Information 16.3 What to Monitor 16.4 Retention 16.5 Meta-monitoring 16.6 Logs

16.6.1 Approach 16.6.2 Timestamps

16.7 Summary Exercises

Chapter 17. Monitoring Architecture and Practice

17.1 Sensing and Measurement

17.1.1 Blackbox versus Whitebox Monitoring 17.1.2 Direct versus Synthesized Measurements 17.1.3 Rate versus Capability Monitoring 17.1.4 Gauges versus Counters

17.2 Collection

17.2.1 Push versus Pull 17.2.2 Protocol Selection 17.2.3 Server Component versus Agent versus Poller 17.2.4 Central versus Regional Collectors

17.3 Analysis and Computation 17.4 Alerting and Escalation Manager

17.4.1 Alerting, Escalation, and Acknowledgments 17.4.2 Silence versus Inhibit

17.5 Visualization

17.5.1 Percentiles 17.5.2 Stack Ranking 17.5.3 Histograms

17.6 Storage 17.7 Configuration 17.8 Summary Exercises

Chapter 18. Capacity Planning

18.1 Standard Capacity Planning

18.1.1 Current Usage 18.1.2 Normal Growth 18.1.3 Planned Growth 18.1.4 Headroom 18.1.5 Resiliency 18.1.6 Timetable

18.2 Advanced Capacity Planning

18.2.1 Identifying Your Primary Resources 18.2.2 Knowing Your Capacity Limits 18.2.3 Identifying Your Core Drivers 18.2.4 Measuring Engagement 18.2.5 Analyzing the Data 18.2.6 Monitoring the Key Indicators 18.2.7 Delegating Capacity Planning

18.3 Resource Regression 18.4 Launching New Services 18.5 Reduce Provisioning Time 18.6 Summary Exercises

Chapter 19. Creating KPIs

19.1 What Is a KPI? 19.2 Creating KPIs

19.2.1 Step 1: Envision the Ideal 19.2.2 Step 2: Quantify Distance to the Ideal 19.2.3 Step 3: Imagine How Behavior Will Change 19.2.4 Step 4: Revise and Select 19.2.5 Step 5: Deploy the KPI

19.3 Example KPI: Machine Allocation

19.3.1 The First Pass 19.3.2 The Second Pass 19.3.3 Evaluating the KPI

19.4 Case Study: Error Budget

19.4.1 Conflicting Goals 19.4.2 A Unified Goal 19.4.3 Everyone Benefits

19.5 Summary Exercises

Chapter 20. Operational Excellence

20.1 What Does Operational Excellence Look Like? 20.2 How to Measure Greatness 20.3 Assessment Methodology

20.3.1 Operational Responsibilities 20.3.2 Assessment Levels 20.3.3 Assessment Questions and Look-For’s

20.4 Service Assessments

20.4.1 Identifying What to Assess 20.4.2 Assessing Each Service 20.4.3 Comparing Results across Services 20.4.4 Acting on the Results 20.4.5 Assessment and Project Planning Frequencies

20.5 Organizational Assessments 20.6 Levels of Improvement 20.7 Getting Started 20.8 Summary Exercises

Epilogue

Part III Appendices

Appendix A. Assessments

A.1 Regular Tasks (RT)