Worker nodes are being destroyed and rebuilt on a regular cadence by TKG
search cancel

Worker nodes are being destroyed and rebuilt on a regular cadence by TKG

book

Article ID: 374776

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Management VMware Tanzu Kubernetes Grid Service (TKGs)

Issue/Introduction

  • In TKG, TKGm or TKGS clusters, unexpected worker node rollouts may be seen when the kubelet process becomes unresponsive on the local worker node. 
  • As the Cluster API is built to to self recover cluster components, if a worker node (kubelet) is unresponsive for 5 minutes, the node will be recreated, leading to deletion of the old node reporting Unknown status.
  • The recreation of the node leads to a challenge in identifying a root cause of the failure
  • This KB seeks to present common causes that might lead to a worker node reporting Unknown status from kubelet.

Cause

Common causes of this issue that have been seen across TKG variants are:

 

  1. Security software locking files or overloading network on the worker nodes. Common security software include:
    • Prisma Twistlock-Defender 
    • AquaSec Enforcer
  2. Infrastructure failure leading to VM's being powered off or rebooted.
  3. Storage failures on underlying infrastructure leading to VM filesystem access failures or corruption.

Resolution

Steps to assist with investigation:

 

  1. Enable log streaming from the cluster to a remote syslog collector. This will help capture logging from applications to help triangulate the source of failures.
    • Fluentbit or Dynatrace are useful tools for log streaming
  2. Identify workloads in the cluster that are high consumers of Network, storage I/O, or Memory/CPU. Seek to isolate these workloads to a single node using Labels on the node. 
  3. If Security software is in use in the cluster, remove it from the single isolated node using Node Taints and updating the application's deployment manifest with Tolerations
    • See the Kubernetes Taints and Tolerations documentation for details on configuration variables.
    • Example:

      Add a taint to a node:

      kubectl taint nodes <NODE_NAME> <APP_NAME>=true:NoSchedule


      Then add a toleration to the workload deployment manifest:


      tolerations:
        - key: "<APP_NAME>"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
       

  4. Depending on the frequency of the failure, consider Pausing the cluster to prevent the node from being automatically recreated (this step should only be performed with a VMware/Broadcom support engineer engaged as it poses significant risk to not only the cluster, but the hosting infrastructure)
    • This will allow the user to SSH into the node or to gather logging from the node after failure.