Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Web Operations: Keeping the Data on Time
SPECIAL OFFER: Upgrade this ebook with O’Reilly Foreword Preface
How This Book Is Organized Who This Book Is For Conventions Used in This Book Using Code Examples How to Contact Us Safari® Books Online Acknowledgments
1. Web Operations: The Career
Why Does Web Operations Have It Tough?
A Strong Background in Computing Practiced Decisiveness A Calm Disposition
From Apprentice to Master
Knowledge Tools Experience
The organizational challenge of inexperience The concept of "senior operations"
Discipline
Conclusion
2. How Picnik Uses Cloud Computing: Lessons Learned
Where the Cloud Fits (and Why!)
Storage Hybrid Computing with EC2
Where the Cloud Doesn't Fit (for Picnik) Conclusion
3. Infrastructure and Application Metrics
Time Resolution and Retention Concerns Locality of Metrics Collection and Storage Layers of Metrics
High-Level Business or Feature-Specific Metrics System- and Service-Level Metrics
Providing Context for Anomaly Detection and Alerts Log Lines Are Metrics, Too Correlation with Change Management and Incident Timelines Making Metrics Available to Your Alerting Mechanisms Using Metrics to Guide Load-Feedback Mechanisms A Metrics Collection System, Illustrated: Ganglia
Background A Quick Introduction to Ganglia
The need to keep collection and aggregation costs low The need to automatically discover new nodes and metrics The need to match network transport with your metrics collection task The need to implicitly prioritize cluster metrics The need to aggregate and organize metrics once they're collected The need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
Conclusion
4. Continuous Deployment
Small Batches Mean Faster Feedback Small Batches Mean Problems Are Instantly Localized Small Batches Reduce Risk Small Batches Reduce Overhead The Quality Defenders' Lament
Why Does It Work?
Getting Started
Step 1: Continuous Integration Server Step 2: Source Control Commit Check Step 3: Simple Deployment Script Step 4: Real-Time Alerting Step 5: Root-Cause Analysis (Five Whys)
Continuous Deployment Is for Mission-Critical Applications
Another Release? Do I Have To? The QA Dilemma
Conclusion
5. Infrastructure As Code
Service-Oriented Architecture
Configuration Management
Configuration management is policy driven System automation is configuration management policy made into code Configuration management in system administration
System Integration
Step 1: Break the infrastructure down into reusable, network-accessible services
The bootstrapping service. The configuration service.
Step 2: Integrate the services together
Conclusion
6. Monitoring
Story: "The Start of a Journey" Step 1: Understand What You Are Monitoring Step 2: Understand Normal Behavior Step 3: Be Prepared and Learn Conclusion
7. How Complex Systems Fail
How Complex Systems Fail
(Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
Complex systems are intrinsically hazardous systems Complex systems are heavily and successfully defended against failure Catastrophe requires multiple failures–single-point failures are not enough Complex systems contain changing mixtures of failures latent within them Complex systems run in degraded mode Catastrophe is always just around the corner Post-accident attribution to a "root cause" is fundamentally wrong Hindsight biases post-accident assessments of human performance Human operators have dual roles: as producers and as defenders against failure All practitioner actions are gambles Actions at the sharp end resolve all ambiguity Human practitioners are the adaptable element of complex systems Human expertise in complex systems is constantly changing Change introduces new forms of failure Views of "cause" limit the effectiveness of defenses against future events Safety is a characteristic of systems and not of their components People continuously create safety Failure-free operations require experience with failure
As It Pertains Specifically to Web Operations
It will be difficult to tell that the system has failed It will be difficult to tell what has failed Meaningful response will be delayed Communications will be strained and tempers will flare Maintenance will be a major source of new failures Recovery from backup is itself difficult and potentially dangerous Create test procedures that front-line people can use to verify system status Manage operations on a daily basis Control maintenance Assess performance at regular intervals Be a (unique) customer
Further Reading
8. Community Management and Web Operations 9. Dealing with Unexpected Traffic Spikes
How It All Started Alarms Abound Putting Out the Fire Surviving the Weekend Preparing for the Future CDN to the Rescue Proxy Servers Corralling the Stampede Streamlining the Codebase How Do We Know It Works? The Real Test Lessons Learned Improvements Since Then
10. Dev and Ops Collaboration and Cooperation
Deployment Shared, Open Infrastructure Trust On-call Developers
Live Debugging Tools Feature Flags
Avoiding Blame Conclusion
11. How Your Visitors Feel: User-Facing Metrics
Why Collect User-Facing Metrics?
Successful Start-ups Learn and Adapt Performance Matters Recent Research Quantifies the Relationship
What Makes a Site Slow?
Service Discovery Sending the Request Thinking About the Response Delivering the Response Asynchronous Traffic and Refresh Rendering Time
Measuring Delay
Synthetic Monitoring
When to use synthetic monitoring Limitations of synthetic monitoring Configuring synthetic monitoring
Real User Monitoring
When to use RUM Limitations of RUM Configuring RUM
Building an SLA
Apdex
Visitor Outcomes: Analytics
How Marketing Defines Success The Four Kinds of Sites A (Very) Basic Model of Analytics Correlating Performance and Analytics by Time Correlating Performance and Analytics by Visits
Other Metrics Marketing Cares About
Web Interaction Analytics Voice of the Customer
How User Experience Affects Web Ops
Many More Stakeholders Monitoring As Part of the Life Cycle, Not Just QA
The Future of Web Monitoring
Moving from Parts to Users Service-Centric Architectures Clouds and Monitoring APIs and RSS Feeds
Delivering an API to others Consuming an API from someone else
Rich Internet Applications HTML5: Server-Sent Events and WebSockets Online Communities and the Long Funnel Tying Together Mail and Conversion Loops The Capacity/Cost/Revenue Equation
Conclusion
12. Relational Database Strategy and Tactics for the Web
Requirements for Web Databases
Always On Mostly Transactional Workload Simple Data, Simple Queries Availability Trumps Consistency Rapid Development Online Deployment Built by Developers
How Typical Web Databases Grow
Single Server Master and Replication Slaves Functional Partitioning Sharding, or Horizontal Partitioning Caching Layer
The Yearning for a Cluster
The CAP Theorem and ACID Versus BASE State of MySQL Clustering
DRBD and Heartbeat Master-Master Replication Manager (MMM) Heartbeat with replication Proxy-based solutions InfiniDB, Galera, Tungsten, and ScaleDB Summary
Database Strategy
Architecture Requirements
Easy wins
Safe-Bet Architectures Risky Architectures
Sharding Writing to more than one master Multilevel replication Ring replication (beyond two nodes) Reliance on DNS The so-called Entity-Attribute-Value (EAV) design pattern
Database Tactics
Taking Backups on a Slave Online Schema Changes Monitoring, Graphing, and Instrumentation Analyzing Performance Archiving and Purging Data
Conclusion
13. How to Make Failure Beautiful: The Art and Science of Postmortems
The Worst Postmortem What Is a Postmortem? When to Conduct a Postmortem Who to Invite to a Postmortem Running a Postmortem Postmortem Follow-Up Conclusion
14. Storage
Data Asset Inventory Data Protection Capacity Planning Storage Sizing Operations Conclusion
15. Nonrelational Databases
NoSQL Database Overview
Pure Key/Value Data Structure Graph Document Oriented Highly Distributed
Some Systems in Detail
Cassandra HBase Riak CouchDB MongoDB Redis
Conclusion
16. Agile Infrastructure
Agile Infrastructure
But Agile Is Not the Only Thing That Has Evolved Some People Are Born to Web Operations, Some People Have Web Operations Thrust upon Them... Working Software Is the Primary Measure of Progress The Application Is the Infrastructure, the Infrastructure Is the Application
So, What's the Problem?
Talk Does Not Cook Rice
The infrastructure is an application Version control: The foundation of sanity Configuration management and automated deployments Monitoring Dev-test-prod life cycle, continuous integration, and disaster recovery Radiate information Reflective process improvement Incremental changes and refactoring The simplest thing that could work Separation of concerns Technical debt Continuous deployment Pairing Managing flow
Communities of Interest and Practice Trading Zones and Apologies
What to Do?
Conclusion
17. Things That Go Bump in the Night (and How to Sleep Through Them)
Definitions How Many 9s? Impact Duration Versus Incident Duration Datacenter Footprint Gradual Failures Trust Nobody Failover Testing Monitoring and History of Patterns Getting a Good Night's Sleep
A. Contributors Index About the Authors Colophon SPECIAL OFFER: Upgrade this ebook with O’Reilly
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion