Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Web Operations: Keeping the Data on Time
SPECIAL OFFER: Upgrade this ebook with O’Reilly
Foreword
Preface
How This Book Is Organized
Who This Book Is For
Conventions Used in This Book
Using Code Examples
How to Contact Us
Safari® Books Online
Acknowledgments
1. Web Operations: The Career
Why Does Web Operations Have It Tough?
A Strong Background in Computing
Practiced Decisiveness
A Calm Disposition
From Apprentice to Master
Knowledge
Tools
Experience
The organizational challenge of inexperience
The concept of "senior operations"
Discipline
Conclusion
2. How Picnik Uses Cloud Computing: Lessons Learned
Where the Cloud Fits (and Why!)
Storage
Hybrid Computing with EC2
Where the Cloud Doesn't Fit (for Picnik)
Conclusion
3. Infrastructure and Application Metrics
Time Resolution and Retention Concerns
Locality of Metrics Collection and Storage
Layers of Metrics
High-Level Business or Feature-Specific Metrics
System- and Service-Level Metrics
Providing Context for Anomaly Detection and Alerts
Log Lines Are Metrics, Too
Correlation with Change Management and Incident Timelines
Making Metrics Available to Your Alerting Mechanisms
Using Metrics to Guide Load-Feedback Mechanisms
A Metrics Collection System, Illustrated: Ganglia
Background
A Quick Introduction to Ganglia
The need to keep collection and aggregation costs low
The need to automatically discover new nodes and metrics
The need to match network transport with your metrics collection task
The need to implicitly prioritize cluster metrics
The need to aggregate and organize metrics once they're collected
The need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
Conclusion
4. Continuous Deployment
Small Batches Mean Faster Feedback
Small Batches Mean Problems Are Instantly Localized
Small Batches Reduce Risk
Small Batches Reduce Overhead
The Quality Defenders' Lament
Why Does It Work?
Getting Started
Step 1: Continuous Integration Server
Step 2: Source Control Commit Check
Step 3: Simple Deployment Script
Step 4: Real-Time Alerting
Step 5: Root-Cause Analysis (Five Whys)
Continuous Deployment Is for Mission-Critical Applications
Another Release? Do I Have To?
The QA Dilemma
Conclusion
5. Infrastructure As Code
Service-Oriented Architecture
Configuration Management
Configuration management is policy driven
System automation is configuration management policy made into code
Configuration management in system administration
System Integration
Step 1: Break the infrastructure down into reusable, network-accessible services
The bootstrapping service.
The configuration service.
Step 2: Integrate the services together
Conclusion
6. Monitoring
Story: "The Start of a Journey"
Step 1: Understand What You Are Monitoring
Step 2: Understand Normal Behavior
Step 3: Be Prepared and Learn
Conclusion
7. How Complex Systems Fail
How Complex Systems Fail
(Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
Complex systems are intrinsically hazardous systems
Complex systems are heavily and successfully defended against failure
Catastrophe requires multiple failures–single-point failures are not enough
Complex systems contain changing mixtures of failures latent within them
Complex systems run in degraded mode
Catastrophe is always just around the corner
Post-accident attribution to a "root cause" is fundamentally wrong
Hindsight biases post-accident assessments of human performance
Human operators have dual roles: as producers and as defenders against failure
All practitioner actions are gambles
Actions at the sharp end resolve all ambiguity
Human practitioners are the adaptable element of complex systems
Human expertise in complex systems is constantly changing
Change introduces new forms of failure
Views of "cause" limit the effectiveness of defenses against future events
Safety is a characteristic of systems and not of their components
People continuously create safety
Failure-free operations require experience with failure
As It Pertains Specifically to Web Operations
It will be difficult to tell that the system has failed
It will be difficult to tell what has failed
Meaningful response will be delayed
Communications will be strained and tempers will flare
Maintenance will be a major source of new failures
Recovery from backup is itself difficult and potentially dangerous
Create test procedures that front-line people can use to verify system status
Manage operations on a daily basis
Control maintenance
Assess performance at regular intervals
Be a (unique) customer
Further Reading
8. Community Management and Web Operations
9. Dealing with Unexpected Traffic Spikes
How It All Started
Alarms Abound
Putting Out the Fire
Surviving the Weekend
Preparing for the Future
CDN to the Rescue
Proxy Servers
Corralling the Stampede
Streamlining the Codebase
How Do We Know It Works?
The Real Test
Lessons Learned
Improvements Since Then
10. Dev and Ops Collaboration and Cooperation
Deployment
Shared, Open Infrastructure
Trust
On-call Developers
Live Debugging Tools
Feature Flags
Avoiding Blame
Conclusion
11. How Your Visitors Feel: User-Facing Metrics
Why Collect User-Facing Metrics?
Successful Start-ups Learn and Adapt
Performance Matters
Recent Research Quantifies the Relationship
What Makes a Site Slow?
Service Discovery
Sending the Request
Thinking About the Response
Delivering the Response
Asynchronous Traffic and Refresh
Rendering Time
Measuring Delay
Synthetic Monitoring
When to use synthetic monitoring
Limitations of synthetic monitoring
Configuring synthetic monitoring
Real User Monitoring
When to use RUM
Limitations of RUM
Configuring RUM
Building an SLA
Apdex
Visitor Outcomes: Analytics
How Marketing Defines Success
The Four Kinds of Sites
A (Very) Basic Model of Analytics
Correlating Performance and Analytics by Time
Correlating Performance and Analytics by Visits
Other Metrics Marketing Cares About
Web Interaction Analytics
Voice of the Customer
How User Experience Affects Web Ops
Many More Stakeholders
Monitoring As Part of the Life Cycle, Not Just QA
The Future of Web Monitoring
Moving from Parts to Users
Service-Centric Architectures
Clouds and Monitoring
APIs and RSS Feeds
Delivering an API to others
Consuming an API from someone else
Rich Internet Applications
HTML5: Server-Sent Events and WebSockets
Online Communities and the Long Funnel
Tying Together Mail and Conversion Loops
The Capacity/Cost/Revenue Equation
Conclusion
12. Relational Database Strategy and Tactics for the Web
Requirements for Web Databases
Always On
Mostly Transactional Workload
Simple Data, Simple Queries
Availability Trumps Consistency
Rapid Development
Online Deployment
Built by Developers
How Typical Web Databases Grow
Single Server
Master and Replication Slaves
Functional Partitioning
Sharding, or Horizontal Partitioning
Caching Layer
The Yearning for a Cluster
The CAP Theorem and ACID Versus BASE
State of MySQL Clustering
DRBD and Heartbeat
Master-Master Replication Manager (MMM)
Heartbeat with replication
Proxy-based solutions
InfiniDB, Galera, Tungsten, and ScaleDB
Summary
Database Strategy
Architecture Requirements
Easy wins
Safe-Bet Architectures
Risky Architectures
Sharding
Writing to more than one master
Multilevel replication
Ring replication (beyond two nodes)
Reliance on DNS
The so-called Entity-Attribute-Value (EAV) design pattern
Database Tactics
Taking Backups on a Slave
Online Schema Changes
Monitoring, Graphing, and Instrumentation
Analyzing Performance
Archiving and Purging Data
Conclusion
13. How to Make Failure Beautiful: The Art and Science of Postmortems
The Worst Postmortem
What Is a Postmortem?
When to Conduct a Postmortem
Who to Invite to a Postmortem
Running a Postmortem
Postmortem Follow-Up
Conclusion
14. Storage
Data Asset Inventory
Data Protection
Capacity Planning
Storage Sizing
Operations
Conclusion
15. Nonrelational Databases
NoSQL Database Overview
Pure Key/Value
Data Structure
Graph
Document Oriented
Highly Distributed
Some Systems in Detail
Cassandra
HBase
Riak
CouchDB
MongoDB
Redis
Conclusion
16. Agile Infrastructure
Agile Infrastructure
But Agile Is Not the Only Thing That Has Evolved
Some People Are Born to Web Operations, Some People Have Web Operations Thrust upon Them...
Working Software Is the Primary Measure of Progress
The Application Is the Infrastructure, the Infrastructure Is the Application
So, What's the Problem?
Talk Does Not Cook Rice
The infrastructure is an application
Version control: The foundation of sanity
Configuration management and automated deployments
Monitoring
Dev-test-prod life cycle, continuous integration, and disaster recovery
Radiate information
Reflective process improvement
Incremental changes and refactoring
The simplest thing that could work
Separation of concerns
Technical debt
Continuous deployment
Pairing
Managing flow
Communities of Interest and Practice
Trading Zones and Apologies
What to Do?
Conclusion
17. Things That Go Bump in the Night (and How to Sleep Through Them)
Definitions
How Many 9s?
Impact Duration Versus Incident Duration
Datacenter Footprint
Gradual Failures
Trust Nobody
Failover Testing
Monitoring and History of Patterns
Getting a Good Night's Sleep
A. Contributors
Index
About the Authors
Colophon
SPECIAL OFFER: Upgrade this ebook with O’Reilly
← Prev
Back
Next →
← Prev
Back
Next →