Worker nodes are being destroyed and rebuilt on a regular cadence by TKG

search cancel

book

calendar_today

VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Management VMware Tanzu Kubernetes Grid Service (TKGs)

In TKG, TKGm or TKGS clusters, unexpected worker node rollouts may be seen when the kubelet process becomes unresponsive on the local worker node.
As the Cluster API is built to to self recover cluster components, if a worker node (kubelet) is unresponsive for 5 minutes, the node will be recreated, leading to deletion of the old node reporting Unknown status.
The recreation of the node leads to a challenge in identifying a root cause of the failure
This KB seeks to present common causes that might lead to a worker node reporting Unknown status from kubelet.

Security software locking files or overloading network on the worker nodes. Common security software include:
- Prisma Twistlock-Defender
- AquaSec Enforcer
Infrastructure failure leading to VM's being powered off or rebooted.
Storage failures on underlying infrastructure leading to VM filesystem access failures or corruption.

Enable log streaming from the cluster to a remote syslog collector. This will help capture logging from applications to help triangulate the source of failures.
- Fluentbit or Dynatrace are useful tools for log streaming
Identify workloads in the cluster that are high consumers of Network, storage I/O, or Memory/CPU. Seek to isolate these workloads to a single node using Labels on the node.
- See the Kubernetes Assigning Pods to Nodes documentation for guidance
If Security software is in use in the cluster, remove it from the single isolated node using Node Taints and updating the application's deployment manifest with Tolerations
- See the Kubernetes Taints and Tolerations documentation for details on configuration variables.
- Example:
  
  Add a taint to a node:
  
  kubectl taint nodes <NODE_NAME> <APP_NAME>=true:NoSchedule
  
  Then add a toleration to the workload deployment manifest:
  
  tolerations:
  - key: "<APP_NAME>"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
Depending on the frequency of the failure, consider Pausing the cluster to prevent the node from being automatically recreated (this step should only be performed with a VMware/Broadcom support engineer engaged as it poses significant risk to not only the cluster, but the hosting infrastructure)
- This will allow the user to SSH into the node or to gather logging from the node after failure.

thumb_up Yes

thumb_down No