Chapter 9. Chaos and Operations

If chaos engineering were just about surfacing evidence of system weaknesses through Game Days and automated chaos experiments, then life would be less complicated. Less complicated, but also much less safe!

In the case of Game Days, much safety can be achieved by executing the Game Day against a sandbox environment and ensuring that everyone—participants, observers, and external parties—is aware the Game Day is happening.1

The challenge is harder with automated chaos experiments. Automated experiments could potentially be executed by anyone, at any time, and possibly against any system.2 There are two main categories of operational concern when it comes to your automated chaos experiments (Figure 9-1):

Control

You or other members of your team may want to seize control of a running experiment. For example you may want to shut it down immediately, or you may just want to be asked whether a particularly dangerous step in the experiment should be executed or skipped.

Observation

You want your experiment to be debuggable as it runs in production. You should be able to see what experiments are currently running, and what step they have just executed, and then trace that back to how other elements of your system are executing in parallel.

An image of the operational concerns associated with a running automated chaos experiment.
Figure 9-1. The control and observation operational concerns of a running automated chaos experiment

There are many implementations and system integrations necessary to support these two concerns, including dashboards, centralized logging systems, management and monitoring consoles, and distributed tracing systems; you can even surface information into Slack!3 The Chaos Toolkit can meet all these needs by providing both control and observation in one operational API.

Experiment “Controls”

A Chaos Toolkit control listens to the execution of an experiment and, if it decides to do so, has the power to change or skip the execution of an activity such as a probe or an action; it is even powerful enough to abort the whole experiment execution!

The control is able to intercept the control flow of the experiment by implementing a corresponding callback function, and doing any necessary work at those points as it sees fit. When a control is enabled in the Chaos Toolkit, the toolkit will invoke any available callback functions on the control as an experiment executes (see Figure 9-2).

An image of how an experiment execution relates to a control's functions
Figure 9-2. A control’s functions, if implemented, are invoked throughout the execution of an experiment

Each callback function is passed any context that is available at that point in the experiment’s execution. For example, the following shows the context that is passed to the after_hypothesis_control function:

def after_hypothesis_control(context: Hypothesis,
                             state: Dict[str, Any],
                             configuration: Configuration = None,
                             secrets: Secrets = None, **kwargs):

In this case, the steady-state hypothesis itself is passed as the context to the after_hypothesis_control function. The control’s callback function can then decide whether to proceed, to do work such as sending a message to some other system, or even to amend the hypothesis if it is useful to do so. This is why a control is so powerful: it can observe and control the execution of an experiment as it is running.

A Chaos Toolkit control is implemented in Python and provides the following full set of callback functions:

def configure_control(config: Configuration, secrets: Secrets):
# Triggered before an experiment's execution.
# Useful for initialization code for the control.
    ...

def cleanup_control():
# Triggered at the end of an experiment's run.
# Useful for cleanup code for the control.
    ...

def before_experiment_control(context: Experiment,
                          configuration: Configuration = None,
                          secrets: Secrets = None, **kwargs):
# Triggered before an experiment's execution.
    ...

def after_experiment_control(context: Experiment,
                           state: Journal, configuration:
                           Configuration = None, secrets:
                           Secrets = None, **kwargs):
# Triggered after an experiment's execution.
    ...

def before_hypothesis_control(context: Hypothesis,
                            configuration: Configuration = None,
                            secrets: Secrets = None, **kwargs):
# Triggered before a hypothesis is analyzed.
    ...

def after_hypothesis_control(context: Hypothesis, state:
                        Dict[str, Any], configuration:
                        Configuration = None, secrets:
                        Secrets = None, **kwargs):
# Triggered after a hypothesis is analyzed.
    ...

def before_method_control(context: Experiment,
                          configuration: Configuration = None,
                          secrets: Secrets = None, **kwargs):
# Triggered before an experiment's method is executed.
    ...

def after_method_control(context: Experiment, state: List[Run],
                         configuration: Configuration = None,
                         secrets: Secrets = None, **kwargs):
# Triggered after an experiment's method is executed.
    ...

def before_rollback_control(context: Experiment,
                            configuration: Configuration = None,
                            secrets: Secrets = None, **kwargs):
# Triggered before an experiment's rollback's block
# is executed.
    ...

def after_rollback_control(context: Experiment, state:
                           List[Run], configuration:
                           Configuration = None, secrets:
                           Secrets = None, **kwargs):
# Triggered after an experiment's rollback's block
# is executed.
    ...

def before_activity_control(context: Activity,
                            configuration: Configuration = None,
                            secrets: Secrets = None, **kwargs):
# Triggered before any experiment's activity
# (probes, actions) is executed.
    ...

def after_activity_control(context: Activity, state: Run,
                           configuration: Configuration = None,
                           secrets: Secrets = None, **kwargs):
# Triggered after any experiment's activity
# (probes, actions) is executed.
    ...

Using these callback functions, the Chaos Toolkit can trigger a host of operational concern implementations that could passively listen to the execution of an experiment, and broadcast that information to anyone interested, or even intervene in the experiment’s execution.

1 See Chapter 3.

2 That is, any system that the user can reach from their own computer.

3 Check out the free ebook Chaos Engineering Observability I wrote for O’Reilly in collaboration with Humio for an explanation of how information from running experiments can be surfaced in Slack using the Chaos Toolkit.

4 “Ideally” because, as explained in the preceding note, controls in the Chaos Toolkit are considered optional.