Monitoring node health

In this recipe, we will learn how to create a DaemonSet in the Kubernetes cluster to monitor node health. The node problem detector will collect node problems from daemons and will report them to the API server as NodeCondition and Event:

From the /src/chapter8 folder first, inspect the content of the node-problem-detector.yaml file and create the DaemonSet to run the node problem detector:

$ cat debug/node-problem-detector.yaml
$ kubectl apply -f debug/node-problem-detector.yaml

Get a list of the nodes in the cluster. This command will return both worker and master nodes:

$ kubectl get nodes
NAME                          STATUS ROLES  AGE   VERSION
ip-172-20-32-169.ec2.internal Ready  node   6d23h v1.14.6
ip-172-20-37-106.ec2.internal Ready  node   6d23h v1.14.6
ip-172-20-48-49.ec2.internal  Ready  master 6d23h v1.14.6
ip-172-20-58-155.ec2.internal Ready  node   6d23h v1.14.6

Describe a node's status by replacing the node name in the following command with one of your node names and running it. In the output, examine the Conditions section for error messages. Here's an example of the output:

$ kubectl describe node ip-172-20-32-169.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
 Type Status LastHeartbeatTime LastTransitionTime Reason Message
 ---- ------ ----------------- ------------------ ------ -------
 NetworkUnavailable False Sat, 12 Oct 2019 00:06:46 +0000 Sat, 12 Oct 2019 00:06:46 +0000 RouteCreated RouteController created a route
 MemoryPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
 DiskPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
 PIDPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
 Ready True Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletReady kubelet is posting ready status

Additionally, you can check for KernelDeadlock, MemoryPressure, and DiskPressure conditions by replacing the last part of the command with one of the conditions. Here is an example for KernelDeadlock:

$ kubectl get node ip-172-20-32-169.ec2.internal -o yaml | grep -B5 KernelDeadlock
 - lastHeartbeatTime: "2019-10-18T23:58:53Z"
 lastTransitionTime: "2019-10-18T23:49:46Z"
 message: kernel has no deadlock
 reason: KernelHasNoDeadlock
 status: "False"
 type: KernelDeadlock

The Node Problem Detector can detect unresponsive runtime daemons; hardware issues such as bad CPU, memory, or disk; kernel issues including kernel deadlock conditions; corrupted filesystems; unresponsive runtime daemons; and also infrastructure daemon issues such as NTP service outages.