Tanzu Kubernetes Grid cluster Worker Nodes in stuck in "Ready,SchedulingDisabled" State
search cancel

Tanzu Kubernetes Grid cluster Worker Nodes in stuck in "Ready,SchedulingDisabled" State

book

Article ID: 387248

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

ESXi host in cluster managing TKG worker nodes failed.

  • TKG worker nodes not returning to ready state, even after adding replacement ESXi host to cluster.

Cause

  • The TKG nodes lost access to storage, or available compute resources became insufficient for the needs of the TKG cluster.
    • If not enough compute resources are available for TKG nodes, the node may not function properly, and will not be able to deploy.

  • An ESXi host in TKG cluster does not have access to datastores (shared storage) that the TKG nodes rely on.
    • If required storage is not accessible, TKG nodes will not be able to attach necessary storage, similar to any other VM.

Resolution

  • If any datastores that the TKG nodes rely on are not listed as accessible for all ESXi hosts in the TKG cluster, add those and confirm the hosts have access.
  • For each affected namespace do the following:

    1. Please note that the following steps may lead to temporary outage of the applications serviced by the affected worker nodes. If the applications are still running, it would be good to perform these steps during a maintenance window.
    2. Edit the cluster to scale down to zero nodes. Make sure to take note of the value currently set for "replicas" before changing. (Manually Scale a Cluster Using Kubectl)

      • Watch the status of the worker nodes (kubectl get nodes -w). Wait at least 5 minutes. The unhealthy worker nodes should auto delete.

      • If any of the nodes get stuck not deleting after waiting, proceed with manually deleting the node(s). It is possible the node is having trouble auto deleting.

    3. Edit the cluster to scale back up to desired replica count (enter the same number as noted from step 2, or enter the current desired value.
      • When watching the status of the nodes, the new worker nodes should start to appear and enter ready state.
      • Note: Please allow for multiple minutes to pass before making any further changes, or performing further troubleshooting.