vSphere Supervisor Workload Cluster Upgrade Stuck due to Node Stuck Deleting caused by PodDisruptionBudget (PDB)
search cancel

vSphere Supervisor Workload Cluster Upgrade Stuck due to Node Stuck Deleting caused by PodDisruptionBudget (PDB)

book

Article ID: 345904

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

In a vSphere Supervisor environment, a workload cluster's upgrade is stuck or rolling redeployment change is stuck due to nodes stuck in Deleting state.

This KB article assumes that the system infrastructure is healthy, but there are pod disruption budgets (PDB) within the affected workload cluster which are preventing the system from gracefully draining then deleting the stuck deleting nodes.

For more information on Pod Disruption Budgets, please see official Kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/

IMPORTANT: Do not manually delete a node that is stuck in Deleting state.

  • Proper troubleshooting actions should be to help the system gracefully drain non-system pods as part of the deletion process.
  • Manually deleting a node can cause further issues such as volume detachment and attachment or further issues preventing the upgrade from progressing.
  • If the upgrade process has yet to reach the worker nodes, deleting a worker node will not progress the upgrade.
    • Upgrade logic begins with upgrading the control plane nodes first. Worker nodes will not be upgraded until all control plane nodes are upgraded successfully.
    • It will more likely cause the worker node to recreate on the same older version or create on the newer version but fail to start due to an image mismatch caused by the manual deletion.

 

While connected to the Supervisor cluster context, the following symptoms are observed:

  • One or more worker nodes in the affected workload cluster is stuck in Deleting state:
    kubectl get machine -n <workload cluster namespace>
    
    NAMESPACE                    NAME               CLUSTER                 NODENAME           PROVIDERID             PHASE
    <workload cluster namespace> <worker node name> <workload cluster name> <worker node name> vsphere://<providerID> Deleting
  • Cluster API (CAPI) logs in the Supervisor cluster show errors similar to the below:
    • Locate the CAPI pods:
      kubectl get pods -A | grep cap
    • Check the CAPI logs for error messages similar to the following:
      kubectl logs deployment/capi-controller-manager -n <capi namespace> -c manager | grep -i "evict"
      
      kubectl logs deployment/capw-controller-manager -n <capi namespace> -c manager | grep -i "evict"
      
      
      machine_controller.go:751] evicting pod <namespace>/<pod within workload cluster>
      machine_controller.go:751] error when evicting pods/"<pod within workload cluster>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    • The above error messages indicate that the system is actively trying to drain the above pod off the deleting node in the workload cluster, but it cannot due to the associated pod disruption budget (PDB).

 

While connected to the affected Workload Cluster's context, the following symptoms are present:

  • One or more worker nodes show Ready,SchedulingDisabled state:
    kubectl get nodes
    
    NAME            STATUS
    <worker node>   Ready,SchedulingDisabled
    • A node will show SchedulingDisabled state when it has been cordoned. Cordoned state indicates that the node is intended to not accept any pods being scheduled on it. The system will automatically place a node that it is draining before deletion into this cordoned state.

    • IMPORTANT: Although the concept of draining is to move pods off of the node onto another healthy node in the cluster, system pods such as the container network interface (CNI): antrea or calico, and vsphere-csi need to remain on the deleting node to help finish the system's draining and deletion process. vsphere-csi system pod is needed for volume detachment of the draining node.

  • There are one or more pod disruption budgets (pdb) with 0 Allowed Disruptions in the vSphere workload cluster:
    kubectl get pdb -A
    NAME        MIN AVAILABLE   MAX UNAVAILABLE      ALLOWED DISRUPTIONS
    <pdb-name> # # 0
    • The Allowed Disruptions section of a PDB is calculated based on the configured minimum available replicas of a pod, or maximum unavailable replicas of a pod in the PDB YAML.

 

  • The pod associated with the above PDB is still present on the node in SchedulingDisabled state which is expected to be draining this pod off this deleting node:
    kubectl get pods -A -o wide | grep <deleting node in SchedulingDisabled state>
    
    NAMESPACE        NAME                      READY        
    <pod namespace> <pod associated with PDB>  #/#

Environment

vSphere Supervisor
 
This issue can occur on a Workload cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

The error "Cannot evict pod as it would violate the pod's disruption budget." indicates that there is a PodDisruptionBudget (PDB) applied to a workload/pod in the workload cluster which is blocking the node from draining and gracefully deleting.

When a Machine object is issued a command to be replaced, the controllers will automatically cordon then drain the Kubernetes node and move/evict the workloads/pods to other available nodes/Machines.

If a PodDisruptionBudget (PDB) associated to one of the workloads/pods exists with zero "Allowed Disruptions", then it will block the node's draining, leaving it in Ready,SchedulingDisabled status because the PDB configuration has been set to prevent a certain number of replicas for this pod from ever being down at any time.

PDBs are more often associated with only worker nodes in a workload cluster rather than control plane nodes. This is because it is expected for applications requiring PDBs to only run on the worker nodes. Because of this, a cluster upgrade will often become stuck due to PDBs after all control plane nodes successfully upgraded to the desired version and after the first new worker node is created on the desired version. Upgrade logic begins with upgrading the control plane nodes first. Worker nodes will not be upgraded until all control plane nodes are upgraded successfully.

 

Min Available PDB Example:

  • The PDB is configured so that there will always be a minimum of 1 replicas of its corresponding pod Running in the cluster.
  • The deployment managing the total number of replicas for the corresponding pod is 1/1.

In this scenario, it is expected for there to be 1 replica of the pod in the cluster total. However, the PDB will never allow it to be brought down as that will violate the minimum availability of always having 1 replica of this pod running in the cluster at all times. This will cause a rolling redeployment or cluster upgrade to become stuck as the PDB is preventing this 1 replica pod from being moved/drained to another node in the cluster.

kubectl get pdb -n <pod namespace>

NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS
<pdb name> 1 N/A 0


kubectl get deployment -n <pod namespace>
NAME READY UP-TO-DATE AVAILABLE
<deployment name> 1/1 1 1

 

Max Available PDB Example:

  • The PDB is configured so that there will only be 1 unhealthy or down replica of its corresponding pod in the cluster.
  • The deployment managing the total number of replicas for the corresponding pod is 2/3, indicating that 1 replica of this pod is currently unavailable or unhealthy.

In this scenario, although the PDB is configured to allow for the corresponding pod to have missing replica, it will not allow for more than just one pod replica to be down.

Until the missing replica comes back up on another healthy node in the cluster, this PDB will prevent any other pod replicas from being drained from any deleting/draining nodes.

During this time, a rolling redeployment or cluster upgrade can become stuck until enough pod replicas are Running in the cluster again.

kubectl get pdb -n <pod namespace>

NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS
<pdb name> N/A 1 0


kubectl get deployment -n <pod namespace>
NAME READY UP-TO-DATE AVAILABLE
<deployment name> 2/3 2 2

Note: The above PDB examples assume deployments are being used for the corresponding pod replicas. There are many Kubernetes objects that can be used to manage and deploy replicas for a pod/application, such as: deployments, daemonsets, replicasets, statefulsets, and jobs.

 

Important: Do not manually delete nodes stuck in Deleting state.

It is best to determine the cause of deletion to unblock the Kubernetes drain to gracefully clean up the node.

Manually deleting a node object differs from deleting a Machine object in that it's not graceful. It won't cordon the node (cordon prevents new pods from scheduling on the node) and it will not gracefully evict the workloads to other available nodes. This can cause issues with the volumes attached to pods that were running on this deleted node.

Although the concept of draining is to move pods off of the node onto another healthy node in the cluster, system pods such as the container network interface (CNI): antrea or calico, and vsphere-csi need to remain on the deleting node to help finish the system's draining and deletion process. vsphere-csi system pod is needed for volume detachment of the draining node.

Resolution

The pod disruption budget (PDB) will need to be temporarily removed from the cluster until cluster node roll-out/replacement completes for all nodes in the cluster.

It is recommended to first take a backup of the pod disruption budget.

These steps will need to be repeated for all pod disruption budgets with Allowed Disruptions of 0 in the cluster.

IMPORTANT: Although the concept of draining is to move pods off of the node onto another healthy node in the cluster, system pods such as the container network interface (CNI): antrea or calico, and vsphere-csi need to remain on the deleting node to help finish the system's draining and deletion process. vsphere-csi system pod is needed for volume detachment of the draining node.

If nodes are still failing to drain after taking down the PDBs temporarily, do not delete VMware system pods to force the drain.

  • Reach out to VMware by Broadcom Technical Support referencing this KB article for assistance.
  • In addition to pod draining, volumes must be detached from a draining node to finish its deletion process gracefully.
    • The following command can be run to check for any volumes still attached to the node:
      kubectl get volumeattachments -A -o wide | grep <draining node name>
    • Warning: Deleting a node that still has volumes attached has the risk of causing issues with those volumes detaching from that node and attaching to another node which prevents the pod which requires that volume from starting. Troubleshooting should be focused on why the volumes are not detaching from the node.

There is a known issue where gatekeeper pods deployed and managed by Tanzu Mission Control (TMC) may have PDBs configured to stop the gatekeeper pods from draining.

 

Workaround Steps

Temporarily remove the PDB from the cluster until cluster node roll-out/replacement completes for all nodes in the cluster.

Note: Before making any changes to the existing PDBs, please consult with the application/workload owner.

  1. Connect to the affected workload cluster's context

  2. Take a backup of the pod disruption budget:
    kubectl get pdb <pdb-name> -n <namespace> -o yaml > <pdb-name>-backup.yaml
  3. Confirm that the backup contains the expected PDB YAML:
    cat <pdb-name>-backup.yaml | less
  4. IMPORTANT: If you are SSH into a workload cluster control plane node, copy the backup to another machine, such as the Supervisor cluster or a jumpbox.
    • If the rolling redeployment or upgrade has yet to reach the current control plane node, this can result in the backup being lost when the current node is replaced.

  5. Perform a kubectl delete on the pod disruption budget:
    kubectl delete pdb <pdb-name> -n <namespace>
  6. After the cluster node roll-out/replacement completes for all nodes in the cluster, the PDB can be restored using the backup:
    kubectl apply -f <pdb-name>-backup.yaml
  7. Confirm that the pdb was restored successfully:
    kubectl get pdb <pdb-name> -n <namespace>

Additional Information

 

To avoid future issues, the total number of pod replicas or Pod Disruption Budget (PDB) can be edited to be more tolerant:

Note: Before making any changes to the existing PDBs, please consult with the application/workload owner.

  1. Connect into the affected workload cluster's context.

  2. Edit the pod disruption budget to either decrease the minAvailable value or increase the maxUnavailable value to make the PDB more tolerant:
    kubectl edit pdb <pdb-name> -n <namespace>
    
    #Decrease .spec.minAvailable value or Increase .spec.maxUnavailable value
  3. Confirm that the edited pod disruption budget change was implemented successfully:
    kubectl get pdb <pdb-name> -n <namespace>