Monitoring node health

In this recipe, we will learn how to create a DaemonSet in the Kubernetes cluster to monitor node health. The node problem detector will collect node problems from daemons and will report them to the API server as NodeCondition and Event:

  1. From the /src/chapter8 folder first, inspect the content of the node-problem-detector.yaml file and create the DaemonSet to run the node problem detector:
$ cat debug/node-problem-detector.yaml
$ kubectl apply -f debug/node-problem-detector.yaml
  1. Get a list of the nodes in the cluster. This command will return both worker and master nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-20-32-169.ec2.internal Ready node 6d23h v1.14.6
ip-172-20-37-106.ec2.internal Ready node 6d23h v1.14.6
ip-172-20-48-49.ec2.internal Ready master 6d23h v1.14.6
ip-172-20-58-155.ec2.internal Ready node 6d23h v1.14.6
  1. Describe a node's status by replacing the node name in the following command with one of your node names and running it. In the output, examine the Conditions section for error messages. Here's an example of the output:
$ kubectl describe node ip-172-20-32-169.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sat, 12 Oct 2019 00:06:46 +0000 Sat, 12 Oct 2019 00:06:46 +0000 RouteCreated RouteController created a route
MemoryPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 18 Oct 2019 23:43:37 +0000 Sat, 12 Oct 2019 00:06:37 +0000 KubeletReady kubelet is posting ready status

  1. Additionally, you can check for KernelDeadlock, MemoryPressure, and DiskPressure conditions by replacing the last part of the command with one of the conditions. Here is an example for KernelDeadlock:
$ kubectl get node ip-172-20-32-169.ec2.internal -o yaml | grep -B5 KernelDeadlock
- lastHeartbeatTime: "2019-10-18T23:58:53Z"
lastTransitionTime: "2019-10-18T23:49:46Z"
message: kernel has no deadlock
reason: KernelHasNoDeadlock
status: "False"
type: KernelDeadlock

The Node Problem Detector can detect unresponsive runtime daemons; hardware issues such as bad CPU, memory, or disk; kernel issues including kernel deadlock conditions; corrupted filesystems; unresponsive runtime daemons; and also infrastructure daemon issues such as NTP service outages.