How many times have you heard comments like the following after a production incident?
“We were not prepared for that!”
“Dashboards were lighting up, we didn’t know where to look...”
“Alarms were going off, and we didn’t know why...”
The absolute worst time to try and learn about the weaknesses in your sociotechnical system is during an incident. There’s panic; there’s stress; there may even be anger. It’s hardly the time to be the person that suggests, “Shall we just step back a moment and ask ourselves how this all happened?” “It’s a little too late, it’s happening!” would be the reply, if you’re lucky and people are feeling inordinately polite.
Chaos engineering has the single goal of helping you collect evidence of system weaknesses before those weaknesses become incidents. While no guarantees can be made that you’ll ever find every one of the multitude of compound potential weaknesses in a system, or even just all the catastrophic ones, it is good engineering sense to be proactive about exploring your system’s weaknesses ahead of time.
So far you have hypotheses of how your system should respond in the event of turbulent conditions. The next step is to grab some tools and start breaking things in production, right? Wrong!
The cheapest way1 to get started with chaos engineering requires no tools. It requires your effort, your time, your team’s time, and ideally the time of anyone who has a stake in your system’s reliability. Effort and time are all you need to plan and run a Game Day.
A Game Day is a practice event, and although it can take a whole day, it usually requires only a few hours. The goal of a Game Day is to practice how you, your team, and your supporting systems deal with real-world turbulent conditions. You collectively get to explore:
How well your alerting systems perform
How your team members react to an incident
Whether you have any indication that your system is healthy or not
How your technical system responds to the turbulent conditions
A Game Day is not just a random event in which things break; it is a controlled, safe, and observed experiment through which you can collect evidence of how your sociotechnical system responds to the turbulent conditions that underpin your chaos engineering hypothesis. At its most basic it doesn’t require any tools at all—just a plan, a notepad and pen, and an awareness of what is going on and just how much you can learn from it!
A Game Day turns a hypothesis from your backlog into an experiment you can execute with your team to rapidly surface weaknesses in your entire sociotechnical system. You deliberately practice for unexpected conditions and capture evidence of any weaknesses in how you, your team, and your supporting systems deal with those conditions, choosing what to learn from and improve over time.
A Game Day can take any form you want as long as it ends up providing detailed and accurate evidence of system weaknesses. As you gain experience planning and running Game Days, you are likely to come up with all sorts of ideas about how you can improve the quality of your findings. For your first few Game Days, the following steps will help you get started successfully:
Pick a hypothesis to explore.
Pick a style.
Decide who participates in your Game Day and who observes.
Decide where your Game Day is going to happen.
Decide when your Game Day is going to start, and how long it will last.
Describe your Game Day experiment.
Get approval!
As you work through this chapter, you will build and then learn how to execute a Game Day plan—but before you can get started, you’ll need to pick a valuable hypothesis to explore during your Game Day.
Trying to decide what to explore in a Game Day can be a pain if all you have to go on is what people tell you worries them. Everyone will have different pet concerns, and all of those concerns are valid. As a Game Day is not inexpensive in terms of time and effort, you will feel the pressure to ensure that the time and attention are spent on something as valuable as possible.
If you created a Hypothesis Backlog earlier (see “Creating Your Hypothesis Backlog”), then you’re in a much better place. The hypotheses in your ever-changing and expanding backlog make it as easy as possible to collectively decide what to invest a Game Day in, as each one describes:
How you hope the system will respond
The turbulent conditions that could be involved (failures, traffic spikes, etc.)
The collective feel for how likely these turbulent conditions might be
The collective feel for how big an impact these conditions might have on the system (although the hypothesis should be that the system will survive in some fashion)
The valuable system qualities that you hope to build trust and confidence in by exploring the hypothesis
Together with your team and stakeholders, you can use these hypothesis descriptions to reach agreement on one that is valuable enough to explore in a Game Day. Then it’s time to consider the type of Game Day you are looking to perform.
Game Days come in an ever-increasing set of styles and formats. They tend to vary in terms of how much prior information is shared with the participants, and thus in what the scope might be for surfacing evidence of weaknesses in the system. A couple of popular Game Day styles are:
Where none of the participants are aware of the conditions they are walking into
Where the participants are told before the Game Day about the type of incident they are walking into
Picking a Game Day style is not just a question of taste. It’s important to consider what evidence you are likely to find if applying one style versus another. For example, an adversarial Dungeons & Dragons–style Game Day in which no one is aware of the actual problem beforehand will explore:
How the team detects turbulent conditions
How the team diagnoses turbulent conditions
How the team responds to turbulent conditions
An Informed in Advance–style Game Day will likely be limited to showing how the participants respond to turbulent conditions.
Deciding who will attend is one of the most important steps in planning your Game Day, but possibly not for the reason you’d expect. You might think that the more important decision is who is going to participate in your Game Day, but that’s usually straightforward: you invite your team and anyone else who would likely be involved if the turbulent conditions of your hypothesis were to happen “for real.” Add all of those names to your “Invite List”—see Figure 3-1.
Then it’s time to consider the second—and just as important—group for your Game Day: the people who are going to observe it.
This list should include everyone who might have any interest in the findings. Think broadly when compiling your observers list, from the CEO on down! You are trying to increase the awareness of the findings for your Game Day.
There are two things to decide when it comes to the question of “where” of your Game Day:
Where will everyone be?
Where will the turbulent conditions occur?
Ideally, the answer to the first question is “where they would be if the actual conditions occurred for real.” The second question is harder to answer. You could conduct the Game Day experiment by applying the turbulent conditions in production, but…you don’t want this Game Day to accidentally turn into a real incident!2
Often Game Day experiments limit their Blast Radius (the potential real-world impact of the experiment) to a safer environment than production, such as a staging environment. Nothing beats production for giving you the best possible evidence of real weaknesses, but if you run a Game Day and you take down production, that just might be the last bit of chaos engineering you’ll be asked to do!
A Game Day experiment can take up an entire day, but don’t be afraid to make it a lot less than a whole day either. Even just three hours is a long time to be in a crisis situation, and don’t think that there won’t be huge amounts of stress involved just because your participants know the issue is in a “safe” staging environment.
You should plan your Game Day for when you anticipate that the participants and observers are going to be most open to, and even eager to learn from, its findings—for example, before a team retrospective.
If your team does retrospectives regularly, then it’s often a great idea to execute your Game Days just before them. This very neatly helps to avoid the problem of having dull retrospectives—nothing beats dullness like walking into a retrospective armed with an emotionally fraught set of evidence of system weaknesses just experienced during a Game Day!
Now it’s time to construct your Game Day experiment. The experiment will include:
A set of measurements that indicate that the system is working in an expected way from a business perspective, and within a given set of tolerances
The set of activities you’re going to use to inject the turbulent conditions into the target system
A set of remediating actions through which you will attempt to repair what you have done knowingly in your experiment’s method
So far you have a backlog of hypotheses, but the emphasis of a steady-state hypothesis is a little different from the hypothesis card shown earlier in Figure 2-7.
“Steady-state” means that you can measure that the system is working in an expected way. In this case, that normal behavior is that “the system will meet its SLOs.” You need to decide what measurement, with what tolerance, would indicate that your system was in fact responding in this timely fashion. For example, you could decide to probe the system on a particular URL and ensure that the system responds with a tolerance that included an expected HTTP status code and within a given time that was set by a Service Level Objective.3
There might be multiple measurements that collectively indicate that your system is behaving, within tolerance, in a normal way. These measurements should be listed, along with their tolerances, as your Game Day’s steady-state hypothesis—see Figure 3-2.
Next, you need to capture what you’re going to do to your system to cause the turbulent conditions, and when you are going to perform those actions within the duration of your Game Day. This collection of actions is called the experiment’s method, and it should read as a list of actions that cause all the failures and other turbulent conditions you need to create for your Game Day—see Figure 3-3.
Finally, you capture a list of remediating actions—or rollbacks as they are usually called—that you can perform to put things back the way they were, to the best of your ability, because you know you caused problems in those areas when you executed the actions in your experiment’s method (see Figure 3-4).
Last but never least, make sure that you create a list of people who will need to be notified and approve of running the Game Day. This could be everyone from the CEO to up- or downstream systems, and on through to third-party users of the system. Add to your Game Day plan a “notifications and approvals” table (see Figure 3-5) and make sure you’ve chased everyone down before the Game Day is announced.
Your job when running your Game Day is to adopt the role of “Game Day facilitator.” This means that you aren’t a participant. Your job is to keep a detailed log of everything that happens so that this record can be used to elicit weaknesses after the event. Feel free to use every piece of technology or trick you can think of to record as many observations as possible during the Game Day; you never know what might lead to an interesting finding. For example, a participant staring off into space for 20 minutes might appear to be of no interest, but if you make a note of it and then ask them about it afterwards, you might find out that they were thinking, “I don’t even know where to start to look,” which is valuable information.
Sometimes you might be facilitating a Game Day experiment against a system that you have little personal knowledge of. This tends to happen when you are asked to help another team begin to adopt chaos engineering. If this is the case, be aware that you might need a “Safety Monitor.”4
A Safety Monitor is likely the most experienced expert on the system that is being used for the Game Day. You have to accept the compromise that they will not be a participant but will be working closely with you, the facilitator, instead. Their job is to alert you if the Game Day is going in a dangerous direction.
For instance, imagine that you planned for the Game Day to be exercised in a safe staging environment, but the team accidentally starts to diagnose and manipulate production. The Safety Monitor, with their expert knowledge of the system, will likely detect this sort of dangerous deviation and can advise you to halt the Game Day before any damage occurs beyond the expected Blast Radius of the experiment.
In this chapter you learned how to build and run your first chaos experiment as a manual Game Day. Game Days are very useful for exploring weaknesses collaboratively. If the Game Day is properly planned, the findings are hard to ignore, and you can achieve much with this relatively small investment in time and effort. Game Days are also perfect for exploring the human side of the system, which is where a huge amount of system weaknesses are often found.
However, Game Days are cheap in tooling but not in time and effort, so there are limitations when it comes to how often you can plan and run them. Each Game Day will help to build more trust and confidence in how your sociotechnical team will react to turbulent conditions, but if you can execute a Game Day only once a week, month, or quarter, then that’s the limit of how often you can learn from weaknesses. In the next part of the book, you’re going to learn how you can overcome these limitations of Game Days by creating automated chaos experiments.
1 Cheapest in terms of tooling, not in terms of time.
2 See “Consider a “Safety Monitor”” for how this can happen even when you don’t plan to use the production environment on Game Day.
3 Find more on how to set SLOs in Site Reliability Engineering, by Niall Richard Murphy et al. (O’Reilly).
4 You might need one regardless; in fact, always consider the possibility.
5 Yes, the name has been changed to protect the innocent.