Rollback

Sometimes, things don't go as expected. A last minute fix in an application release inadvertently introduced a new bug, or the new version significantly decreases the throughput of the component, and so on. In such cases, we need to have a plan B which in most cases means the ability to roll back the update to the previous good version.

As with the update, the rollback has to happen in a such a way that it does not cause any outages of the application; it needs to cause zero downtime. In that sense, a rollback can be looked at as a reverse update. We are installing a new version, yet this new version is actually the previous version.

As with the update behavior, we can declare, either in our stack files or in the Docker service create command, how the system should behave in case it needs to execute a rollback. Here, we have the stack file that we used before, but this time with some rollback-relevant attributes:

version: "3.5"
services:
web:
image: nginx:1.12-alpine
ports:
- 80:80
deploy:
replicas: 10
update_config:
parallelism: 2
delay: 10s

failure_action: rollback
monitor: 10s

healthcheck:
test: ["CMD", "wget", "-qO", "-", "http://localhost"]
interval: 2s
timeout: 2s
retries: 3
start_period: 2s

In this stack file, which is available in our lab as stack-rollback.yaml, we have defined the details about the rolling update, the health checks, and the behavior during rollback. The health check is defined so that after an initial wait time of 2 seconds, the orchestrator starts to poll the service on http://localhost every 2 seconds and it retries 3 times before it considers a task as unhealthy. If we do the math, then it takes at least 8 seconds until a task will be stopped if it is unhealthy due to a bug. So, now under deploy, we have a new entry monitor. This entry defines how long newly deployed tasks should be monitored for health as a decision point whether or not to continue with the next batch in the rolling update. Here, in this sample, we have given it 10 seconds. This is slightly more than the 8 seconds we calculated it takes to discover that a defective service has been deployed. So this is good.

We also have a new entry, failure_action, which defines what the orchestrator will do if it encounters a failure during the rolling update such as that the service is unhealthy. By default, the action is just to stop the whole update process and leave the system in an intermediate state. The system is not down since it is a rolling update and at least some healthy instances of the service are still operational, but some operations engineer better at taking a look and fixing the problem.

In our case, we have defined the action to be rollback. Thus, in case of failure, SwarmKit will automatically revert all tasks that have been updated back to their previous version.