In TKG, TKGm or TKGS clusters, unexpected worker node rollouts may be seen when the kubelet process becomes unresponsive on the local worker node.
As the Cluster API is built to to self recover cluster components, if a worker node (kubelet) is unresponsive for 5 minutes, the node will be recreated, leading to deletion of the old node reporting Unknown status.
The recreation of the node leads to a challenge in identifying a root cause of the failure
This KB seeks to present common causes that might lead to a worker node reporting Unknown status from kubelet.
Cause
Common causes of this issue that have been seen across TKG variants are:
Security software locking files or overloading network on the worker nodes. Common security software include:
Prisma Twistlock-Defender
AquaSec Enforcer
Infrastructure failure leading to VM's being powered off or rebooted.
Storage failures on underlying infrastructure leading to VM filesystem access failures or corruption.
Resolution
Steps to assist with investigation:
Enable log streaming from the cluster to a remote syslog collector. This will help capture logging from applications to help triangulate the source of failures.
Fluentbit or Dynatrace are useful tools for log streaming
Identify workloads in the cluster that are high consumers of Network, storage I/O, or Memory/CPU. Seek to isolate these workloads to a single node using Labels on the node.
If Security software is in use in the cluster, remove it from the single isolated node using Node Taints and updating the application's deployment manifest with Tolerations
Depending on the frequency of the failure, consider Pausing the cluster to prevent the node from being automatically recreated (this step should only be performed with a VMware/Broadcom support engineer engaged as it poses significant risk to not only the cluster, but the hosting infrastructure)
This will allow the user to SSH into the node or to gather logging from the node after failure.