Alerting the System

Our alerting is already set up. Alertmanager is configured to send notifications to Slack. While that was a good step forward, it is still far from having alerting that serves as the base of a self-adapting and self-healing system. What we did by now can be considered a fall-back strategy. If the system cannot detect changed conditions and, when needed, adapt or heal itself, notifying humans through Slack is a good solution. In some cases, Slack notifications will be temporary and replaced with requests to the system that will auto-correct itself. In other situations, the system will not be able to fix itself, so notifications will have to be sent to doctors (us, humans, engineers).

We already built the initial solution for an alerting system. Alertmanager can fulfill some of our needs. It is not alone, and there is another one that we used throughout the book, even though we never mentioned it in this context. I'm sure that you can guess which one it is. If you can't, I'll leave you in suspense for a while longer.

Before we proceed and start building the system that will receive the alerts, we should discuss the types of actions a system might need to perform.