We have learned that a swarm service is a description or manifest of the desired state that we want an application or application service to run in. Now, let's see how Docker Swarm reconciles this desired state if we do something that causes the actual state of the service to be different from the desired state. The easiest way to do this is to forcibly kill one of the tasks or containers of the service.
Let's do this with the container that has been scheduled on node-1:
$ docker container rm -f sample-stack_whoami.2.n21e7ktyvo4b2sufalk0aibzy
If we do that and then do a docker service ps right thereafter, we will see the following output:
We see that task 2 failed with exit code 137 and that the swarm immediately reconciled the desired state by rescheduling the failed task on a node with free resources. In this case, the scheduler selected the same node as the failed tasks, but this is not always the case. So, without us intervening, the swarm completely fixed the problem, and since the service is running in multiple replicas, at no time was the service down.
Let's try another failure scenario. This time we're going to shut down an entire node and are going to see how the swarm reacts. Let's take node-2 for this, as it has two tasks (tasks 3 and 4) running on it. For this we need to open a new terminal window and use Docker machine to stop node-2:
$ docker-machine stop node-2
Back on node-1, we can now again run docker service ps to see what happened:
In the preceding screenshot, we can see that immediately task 3 was rescheduled on node-1 whilst task 4 was rescheduled on node-3. Even this more radical failure is handled gracefully by Docker Swarm.
It is important to note though that if node-2 ever comes back online in the swarm, the tasks that had previously been running on it will not automatically be transferred back to it. But the node is now ready for a new workload.