Using Prometheus as an end-to-end solution for alerting

By instrumenting our applications and deploying the necessary infrastructure for scraping metrics, we now have the means for evaluating the SLIs for each of our services. Once we define a suitable set of SLOs for each of the SLIs, the next item on our checklist is to deploy an alert system so that we can be automatically notified every time that our SLOs stop being met.

A typical alert specification looks like this:

When the value of metric exceeds threshold Y for Z time units, then execute actions a1, a2, an

What is the first thought that springs to mind when you hear a fire alarm going off? Most people will probably answer something along the lines of, there might be a fire nearby. People are naturally conditioned to assume that alerts are always temporally correlated with an issue that must be addressed immediately.

When it comes to monitoring the health of production systems, having alerts in place that require the immediate intervention of a human operator once they trigger is pretty much a standard operating procedure. However, this is not the only type of alert that an SRE might encounter when working on such a system. Oftentimes, being able to proactively detect and address issues before they get out of hand and become a risk for the stability of production systems is the only thing that stands between a peaceful night's sleep and that dreaded 2 AM page call.

Here is an example of a proactive alert: an SRE sets up an alert that fires once the disk usage on a database node exceeds 80% of the available storage capacity. Note that when the alarm does fire, the database is still working without any issue. However, in this case, the SRE is provided with ample time to plan and execute the required set of steps (for example, schedule a maintenance window to resize the disk assigned to the DB) to rectify the issue with the minimum disruption possible to the database service.

Contrast the preceding case with a different scenario where the SRE is paged because the database did run out of space and, as a result, several services with a downstream dependency on the database are now offline. This is a particularly stressful situation for an SRE to be in as the system is already experiencing downtime.