Health monitoring

With health monitoring, we monitor the health of a system and its various components at a certain frequency, typically a few seconds. This ensures that the system and its components behave as expected. With the help of an exhaustive health monitoring system, we can keep tabs on the overall system health, including the CPU, memory utilization, and so on. It might be in the form of pings or extensive health monitoring endpoints, which emit the health status of services along with some useful metadata at that point in time.

For health monitoring, we can use the rate of request failures and successes; we can also utilize techniques such as synthetic user monitoring. We will see synthetic user monitoring a little later in this chapter.

The metrics for health monitoring are based on the threshold values of success or failure rates. If the parameter value goes beyond the configured threshold, an alert is triggered. It is quite possible that some preventive action to maintain the health of the system would be triggered due to this failure. This action could be restarting the service in the failure state, or provisioning some server resource.