Health checks

To make informed decisions, for example, during a rolling update of a swarm service whether or not the just-installed batch of new service instances is running OK or if a rollback is needed, the SwarmKit needs a way to know about the overall health of the system. On its own, SwarmKit (and Docker) can collect quite a bit of information. But there is a limit. Imagine a container containing an application. The container, as seen from outside, can look absolutely healthy and chuckle away just fine. But that doesn't necessarily mean that the application running inside the container is also doing well. The application could, for example, be in an infinite loop or be in a corrupt state, yet still running. But, as long as the application runs, the container runs and from outside, everything looks perfect.

Thus, SwarmKit provides a seam where we can provide it with some help. We, the authors of the application services running inside the containers in the swarm, know best whether or not our service is in a healthy state. SwarmKit gives us the opportunity to define a command that is executed against our application service to test its health. What exactly this command does is not important to Swarm, the command just needs to return OK or NOT OK or time out. The latter two situations, namely NOT OK or timeout, will tell SwarmKit that the task it is investigating is potentially unhealthy. Here, I am writing potentially on purpose and later, we will see why:

FROM alpine:3.6
...
HEALTHCHECK --interval=30s \
    --timeout=10s
    --retries=3
    --start-period=60s
    CMD curl -f http://localhost:3000/health || exit 1
...

In the preceding snippet from a Dockerfile, we see the keyword HEALTHCHECK. It has a few options or parameters and an actual command CMD. Let's first discuss the options:

--interval defines the wait time between health checks. Thus, in our case the orchestrator executes a check every 30 seconds.
The --timeout parameter defines how long Docker should wait if the health check does not respond until it times out with an error. In our sample, this is 10 seconds. Now, if one health check fails, the SwarmKit retries a couple of times until it gives up and declares the corresponding task as unhealthy and opens the door for Docker to kill this task and replace it by a new instance.
The number of retries is defined with the parameter --retries. In the preceding code, we want to have three retries.
Next, we have the start period. Some containers need some time to start up (not that this is a recommended pattern, but sometimes it is inevitable). During this start up time, the service instance might not be able to respond to health checks. With the start period, we can define how long the SwarmKit should wait before it executes the very first health check and thus give the application time to initialize. To define the start up time, we use the --start-period parameter. In our case, we do the first check after 60 seconds. How long this start period needs to be totally depends on the application and its start up behavior. The recommendation is to start with a relatively low value and if you have a lot of false positives and tasks that are restarted many times, you might want to increase the time interval.
Finally, we define the actual probing command on the last line with the CMD keyword. In our case, we are defining a request to the /health endpoint of localhost at port 3000 as a probing command. This call is expected to have three possible outcomes:
- The command succeeds
- The command fails
- The command times out

The latter two are treated the same way by SwarmKit. It is an indication to the orchestrator that the corresponding task might be unhealthy. I did say might with intent since SwarmKit does not immediately assume the worst case scenario but assumes that this might just be a temporary fluke of the task and that it will recover from it. This is the reason why we have a --retries parameter. There, we can define how many times SwarmKit should retry before it can assume that the task is indeed unhealthy, and consequently kill it and reschedule another instance of this task on another free node to reconcile the desired state of the service.

Why can we use localhost in our probing command? This is a very good question, and the reason is because SwarmKit, when probing a container running in the swarm, executes this probing command inside the container (that is, it does something like docker container exec <containerID> <probing command>). Thus, the command executes in the same network namespace as the application running inside the container. In the following diagram, we see the life cycle of a service task from its beginning:

Service task with transient health failure

First, SwarmKit waits with probing until the start period is over. Then, we have a first health check. Shortly thereafter, the task fails when probed. It fails two consecutive times but then it recovers. Thus, health check number 4 is again successful and SwarmKit leaves the task running.

Here, we, see a task that is permanently failing:

Permanent failure of task

If the task does not recover and after having three retries (or as many as you have defined), then SwarmKit first sends a SIGTERM to the container of the task, and if that times out after 10 seconds, it sends a SIGKILL signal.

We have just learned how we can define a health check for a service in the Dockerfile of its image. But this is not the only way. We can also define the health check in a stack file that we use to deploy our application into a Docker Swarm. Here is a short snippet of what such a stack file would look like:

version: "3.5"
services:
  web:
    image: example/web:1.0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
...

In this snippet, we see how the health check-related information is defined in the stack file. First and foremost, it is important to realize that we have to define a health check for every service individually. There is no health check on an application or global level.

Similar to what we have defined previously in the Dockerfile, the command that is used to execute the health check by the SwarmKit is curl -f http://localhost:3000/health. We also have definitions for interval, timeout, retries, and start_period. These latter four key-value pairs have the same meaning as the corresponding parameters we used in the Dockerfile. If there are health check-related settings defined in the image, then the ones defined in the stack file override the ones from the Dockerfile.

Now, let's try to use a service that has a health check defined. In our lab folder, we have a file called stack-health.yaml with the following content:

version: "3.5"
services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "wget", "-qO", "-", "http://localhost"]
      interval: 5s
      timeout: 2s
      retries: 3
      start_period: 15s

That we're going to deploy now:

$ docker stack deploy -c stack-health.yaml myapp

We can find out where the single task got deployed to by using docker stack ps myapp. On that particular node, we can list all containers to find the one of our stack. In my example, the task had been deployed to node-3:

Displaying the health status of a running task instance

The interesting thing in this screenshot is the STATUS column. Docker, or more precisely SwarmKit, has recognized that the service has a health check function defined and is using it to determine the health of each task of the service.