Verifying fault tolerance with Gameday exercises

This chapter contains recipes that should help you create more reliable, resilient microservice architectures. Each recipe documents a pattern or technique for anticipating and dealing with some kind of failure scenario. Our aim when building resilient systems is to tolerate failure with as little impact to our users as possible. Anticipating and designing for failure is essential when building distributed systems, but without verifying that our systems handle failure in the ways we expect, we aren't doing much more than hoping, and hope is definitely not a strategy!

When building systems, unit and functional tests are necessary parts of our confidence-building toolkit. However, these tools alone are not enough. Unit and functional tests work by isolating dependencies, good unit tests, for instance, don't rely on network conditions, and functional tests don't involve testing under production-level traffic conditions, instead focusing on various software components working together properly under ideal conditions. To gain more confidence in the fault tolerance of a system, it's necessary to observe it responding to failure in production.

Gameday exercises are another useful tool for building confidence in the resiliency of a system. These exercises involve forcing certain failure scenarios in production to verify that our assumptions about fault tolerance match reality. John Allspaw describes this practice in detail in his paper, Fault Injection in Production. If we accept that failure is impossible to avoid completely, it becomes sensible to force failure and observe how our system responds to it as a planned exercise. It’s better to have a system fail for the first time while an entire team is watching and ready to take action, than at 3 a.m. when a system alert wakes up an on-call engineer.

Planning a Gameday exercise provides a large amount of value. Engineers should get together and brainstorm the various failure scenarios their service is likely to experience. Work should then be scheduled to try to reduce or eliminate the impact of those scenarios (that is, in the event of database failure, revert to a cache). Each Gameday exercise should have a planning document that describes the system being tested, the various failure scenarios, including steps that will be taken to simulate the failures, expectations surrounding how the system should respond to the failures, and the expected impact on users (if any). As the Gameday exercise proceeds, the team should work through each of the scenarios, documenting observations—it’s important to ensure that metrics we expect to see emitted are being emitted, alerts that we expect to fire do indeed fire, and the failure is handled in the way we expect. As observations are made, document any differences between expectations and reality. These observations should become planned work to bridge the gap between our ideal world and the real world.

Instead of walking through code, this recipe will demonstrate a process and template that can be used to run Gameday exercises. The following is not the only way to conduct Gameday exercises, but one that should serve as a good starting point for your organization.