Chapter 3. Planning and Running a Manual Game Day

How many times have you heard comments like the following after a production incident?

“We were not prepared for that!”

“Dashboards were lighting up, we didn’t know where to look...”

“Alarms were going off, and we didn’t know why...”

The absolute worst time to try and learn about the weaknesses in your sociotechnical system is during an incident. There’s panic; there’s stress; there may even be anger. It’s hardly the time to be the person that suggests, “Shall we just step back a moment and ask ourselves how this all happened?” “It’s a little too late, it’s happening!” would be the reply, if you’re lucky and people are feeling inordinately polite.

Chaos engineering has the single goal of helping you collect evidence of system weaknesses before those weaknesses become incidents. While no guarantees can be made that you’ll ever find every one of the multitude of compound potential weaknesses in a system, or even just all the catastrophic ones, it is good engineering sense to be proactive about exploring your system’s weaknesses ahead of time.

So far you have hypotheses of how your system should respond in the event of turbulent conditions. The next step is to grab some tools and start breaking things in production, right? Wrong!

The cheapest way1 to get started with chaos engineering requires no tools. It requires your effort, your time, your team’s time, and ideally the time of anyone who has a stake in your system’s reliability. Effort and time are all you need to plan and run a Game Day.

Planning Your Game Day

A Game Day can take any form you want as long as it ends up providing detailed and accurate evidence of system weaknesses. As you gain experience planning and running Game Days, you are likely to come up with all sorts of ideas about how you can improve the quality of your findings. For your first few Game Days, the following steps will help you get started successfully:

  1. Pick a hypothesis to explore.

  2. Pick a style.

  3. Decide who participates in your Game Day and who observes.

  4. Decide where your Game Day is going to happen.

  5. Decide when your Game Day is going to start, and how long it will last.

  6. Describe your Game Day experiment.

  7. Get approval!

As you work through this chapter, you will build and then learn how to execute a Game Day plan—but before you can get started, you’ll need to pick a valuable hypothesis to explore during your Game Day.

Decide Who Participates and Who Observes

Deciding who will attend is one of the most important steps in planning your Game Day, but possibly not for the reason you’d expect. You might think that the more important decision is who is going to participate in your Game Day, but that’s usually straightforward: you invite your team and anyone else who would likely be involved if the turbulent conditions of your hypothesis were to happen “for real.” Add all of those names to your “Invite List”—see Figure 3-1.

An image of the Game Day Participants and Observers List.
Figure 3-1. Game Day participants and observers list

Then it’s time to consider the second—and just as important—group for your Game Day: the people who are going to observe it.

This list should include everyone who might have any interest in the findings. Think broadly when compiling your observers list, from the CEO on down! You are trying to increase the awareness of the findings for your Game Day.

Decide Where

There are two things to decide when it comes to the question of “where” of your Game Day:

  • Where will everyone be?

  • Where will the turbulent conditions occur?

Ideally, the answer to the first question is “where they would be if the actual conditions occurred for real.” The second question is harder to answer. You could conduct the Game Day experiment by applying the turbulent conditions in production, but…you don’t want this Game Day to accidentally turn into a real incident!2

Often Game Day experiments limit their Blast Radius (the potential real-world impact of the experiment) to a safer environment than production, such as a staging environment. Nothing beats production for giving you the best possible evidence of real weaknesses, but if you run a Game Day and you take down production, that just might be the last bit of chaos engineering you’ll be asked to do!

Describe Your Game Day Experiment

Now it’s time to construct your Game Day experiment. The experiment will include:

A steady-state hypothesis

A set of measurements that indicate that the system is working in an expected way from a business perspective, and within a given set of tolerances

A method

The set of activities you’re going to use to inject the turbulent conditions into the target system

Rollbacks

A set of remediating actions through which you will attempt to repair what you have done knowingly in your experiment’s method

So far you have a backlog of hypotheses, but the emphasis of a steady-state hypothesis is a little different from the hypothesis card shown earlier in Figure 2-7.

“Steady-state” means that you can measure that the system is working in an expected way. In this case, that normal behavior is that “the system will meet its SLOs.” You need to decide what measurement, with what tolerance, would indicate that your system was in fact responding in this timely fashion. For example, you could decide to probe the system on a particular URL and ensure that the system responds with a tolerance that included an expected HTTP status code and within a given time that was set by a Service Level Objective.3

There might be multiple measurements that collectively indicate that your system is behaving, within tolerance, in a normal way. These measurements should be listed, along with their tolerances, as your Game Day’s steady-state hypothesis—see Figure 3-2.

An image of the steady-state-hypothesis.
Figure 3-2. Chaos experiment steady-state hypothesis

Next, you need to capture what you’re going to do to your system to cause the turbulent conditions, and when you are going to perform those actions within the duration of your Game Day. This collection of actions is called the experiment’s method, and it should read as a list of actions that cause all the failures and other turbulent conditions you need to create for your Game Day—see Figure 3-3.

An image of the experiment method.
Figure 3-3. Chaos experiment method

Finally, you capture a list of remediating actions—or rollbacks as they are usually called—that you can perform to put things back the way they were, to the best of your ability, because you know you caused problems in those areas when you executed the actions in your experiment’s method (see Figure 3-4).

An image of the experiment complete with rollback.
Figure 3-4. Chaos experiment complete with rollbacks

Get Approval!

Last but never least, make sure that you create a list of people who will need to be notified and approve of running the Game Day. This could be everyone from the CEO to up- or downstream systems, and on through to third-party users of the system. Add to your Game Day plan a “notifications and approvals” table (see Figure 3-5) and make sure you’ve chased everyone down before the Game Day is announced.

An image of the Game Day Notifications and Approvals.
Figure 3-5. Table of notifications and approvals

Running the Game Day

Your job when running your Game Day is to adopt the role of “Game Day facilitator.” This means that you aren’t a participant. Your job is to keep a detailed log of everything that happens so that this record can be used to elicit weaknesses after the event. Feel free to use every piece of technology or trick you can think of to record as many observations as possible during the Game Day; you never know what might lead to an interesting finding. For example, a participant staring off into space for 20 minutes might appear to be of no interest, but if you make a note of it and then ask them about it afterwards, you might find out that they were thinking, “I don’t even know where to start to look,” which is valuable information.

Consider a “Safety Monitor”

Sometimes you might be facilitating a Game Day experiment against a system that you have little personal knowledge of. This tends to happen when you are asked to help another team begin to adopt chaos engineering. If this is the case, be aware that you might need a “Safety Monitor.”4

A Safety Monitor is likely the most experienced expert on the system that is being used for the Game Day. You have to accept the compromise that they will not be a participant but will be working closely with you, the facilitator, instead. Their job is to alert you if the Game Day is going in a dangerous direction.

For instance, imagine that you planned for the Game Day to be exercised in a safe staging environment, but the team accidentally starts to diagnose and manipulate production. The Safety Monitor, with their expert knowledge of the system, will likely detect this sort of dangerous deviation and can advise you to halt the Game Day before any damage occurs beyond the expected Blast Radius of the experiment.

1 Cheapest in terms of tooling, not in terms of time.

2 See “Consider a “Safety Monitor”” for how this can happen even when you don’t plan to use the production environment on Game Day.

3 Find more on how to set SLOs in Site Reliability Engineering, by Niall Richard Murphy et al. (O’Reilly).

4 You might need one regardless; in fact, always consider the possibility.

5 Yes, the name has been changed to protect the innocent.