Once we manage to create flows that allow us to deploy new releases without downtime and, at the same time, reconfigure all dependent services, we can move forward and solve the problem of self-adaptation applied to services. The goal is to create a system that would scale (and de-scale) services depending on metrics. That way, our services can operate efficiently no matter the changes imposed from outside. For example, we could increase the number of replicas if response times of a predefined percentile are too high.
Prometheus (https://prometheus.io/) periodically scrapes metrics both from generic exporters as well as from our services. We accomplished the latter by instrumenting them. Exporters are useful for global metrics like those generated by containers (for example, cAdvisor (https://github.com/google/cadvisor)) or nodes (for example, Node exporter (https://github.com/prometheus/node_exporter)). Instrumentation, on the other hand, is useful when we want more detailed metrics specific to our service (for example, the response time of a specific function).
We configured Prometheus (through Docker Flow Monitor (DFM) http://monitor.dockerflow.com/) not only to scrape metrics from exporters and instrumented services but also to evaluate alerts that are fired to Alertmanager (https://github.com/prometheus/alertmanager). It, in turn, filters fired alerts and sends notifications to other parts of the system (internal or external).
When possible, alert notifications should be sent to one or more services that will "correct" the state of the cluster automatically. For example, alert notification that was fired because response times of a service are too long should result in scaling of that service. Such an action is relatively easy to script. It is a repeatable operation that can be easily executed by a machine and, therefore, is a waste of human time. We used Jenkins as a tool that allows us to perform tasks like scaling (up or down).
Alert notifications should be sent to humans only if they are a result of an unpredictable situation. Alerts based on conditions that never happened before are a good candidate for human intervention. We're good at solving unexpected issues; machines are good at repeatable tasks. Still, even in those never-seen-before cases, we (humans) should not only solve the problem, but also create a script that will repeat the same steps the next time the same issue occurs. The first time an alert resulted in a notification to a human, it should be converted into a notification to a machine that will employ the same steps we did previously. In other words, solve the problem yourself when it happens the first time, and let the machines repeat the solution if it happens again. Throughout the book, we used Slack as a notification engine to humans, and Jenkins as a machine receptor of those notifications.