In a vSphere Supervisor environment, a workload cluster's upgrade is stuck or rolling redeployment change is stuck due to nodes stuck in Deleting state.
This KB article assumes that the system infrastructure is healthy, but there are pod disruption budgets (PDB) within the affected workload cluster which are preventing the system from gracefully draining then deleting the stuck deleting nodes.
For more information on Pod Disruption Budgets, please see official Kubernetes documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get machine -n <workload cluster namespace>
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE
<workload cluster namespace> <worker node name> <workload cluster name> <worker node name> vsphere://<providerID> Deleting
kubectl get pods -A | grep cap
kubectl logs deployment/capi-controller-manager -n <capi namespace> -c manager | grep -i "evict"
kubectl logs deployment/capw-controller-manager -n <capi namespace> -c manager | grep -i "evict"
machine_controller.go:751] evicting pod <namespace>/<pod within workload cluster>
machine_controller.go:751] error when evicting pods/"<pod within workload cluster>" -n "<namespace>" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
While connected to the affected Workload Cluster's context, the following symptoms are present:
kubectl get nodes
NAME STATUS
<worker node> Ready,SchedulingDisabled
kubectl get pdb -A
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS
<pdb-name> # # 0
kubectl get pods -A -o wide | grep <deleting node in SchedulingDisabled state>
NAMESPACE NAME READY
<pod namespace> <pod associated with PDB> #/#
The error "Cannot evict pod as it would violate the pod's disruption budget." indicates that there is a PodDisruptionBudget (PDB) applied to a workload/pod in the workload cluster which is blocking the node from draining and gracefully deleting.
When a Machine object is issued a command to be replaced, the controllers will automatically cordon then drain the Kubernetes node and move/evict the workloads/pods to other available nodes/Machines.
If a PodDisruptionBudget (PDB) associated to one of the workloads/pods exists with zero "Allowed Disruptions", then it will block the node's draining, leaving it in Ready,SchedulingDisabled status because the PDB configuration has been set to prevent a certain number of replicas for this pod from ever being down at any time.
PDBs are more often associated with only worker nodes in a workload cluster rather than control plane nodes. This is because it is expected for applications requiring PDBs to only run on the worker nodes. Because of this, a cluster upgrade will often become stuck due to PDBs after all control plane nodes successfully upgraded to the desired version and after the first new worker node is created on the desired version. Upgrade logic begins with upgrading the control plane nodes first. Worker nodes will not be upgraded until all control plane nodes are upgraded successfully.
Min Available PDB Example:
In this scenario, it is expected for there to be 1 replica of the pod in the cluster total. However, the PDB will never allow it to be brought down as that will violate the minimum availability of always having 1 replica of this pod running in the cluster at all times. This will cause a rolling redeployment or cluster upgrade to become stuck as the PDB is preventing this 1 replica pod from being moved/drained to another node in the cluster.
kubectl get pdb -n <pod namespace>
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS
<pdb name> 1 N/A 0
kubectl get deployment -n <pod namespace>
NAME READY UP-TO-DATE AVAILABLE
<deployment name> 1/1 1 1
Max Available PDB Example:
In this scenario, although the PDB is configured to allow for the corresponding pod to have missing replica, it will not allow for more than just one pod replica to be down.
Until the missing replica comes back up on another healthy node in the cluster, this PDB will prevent any other pod replicas from being drained from any deleting/draining nodes.
During this time, a rolling redeployment or cluster upgrade can become stuck until enough pod replicas are Running in the cluster again.
kubectl get pdb -n <pod namespace>
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS
<pdb name> N/A 1 0
kubectl get deployment -n <pod namespace>
NAME READY UP-TO-DATE AVAILABLE
<deployment name> 2/3 2 2
Note: The above PDB examples assume deployments are being used for the corresponding pod replicas. There are many Kubernetes objects that can be used to manage and deploy replicas for a pod/application, such as: deployments, daemonsets, replicasets, statefulsets, and jobs.
It is best to determine the cause of deletion to unblock the Kubernetes drain to gracefully clean up the node.
Manually deleting a node object differs from deleting a Machine object in that it's not graceful. It won't cordon the node (cordon prevents new pods from scheduling on the node) and it will not gracefully evict the workloads to other available nodes. This can cause issues with the volumes attached to pods that were running on this deleted node.
Although the concept of draining is to move pods off of the node onto another healthy node in the cluster, system pods such as the container network interface (CNI): antrea or calico, and vsphere-csi need to remain on the deleting node to help finish the system's draining and deletion process. vsphere-csi system pod is needed for volume detachment of the draining node.
The pod disruption budget (PDB) will need to be temporarily removed from the cluster until cluster node roll-out/replacement completes for all nodes in the cluster.
It is recommended to first take a backup of the pod disruption budget.
These steps will need to be repeated for all pod disruption budgets with Allowed Disruptions of 0 in the cluster.
If nodes are still failing to drain after taking down the PDBs temporarily, do not delete VMware system pods to force the drain.
kubectl get volumeattachments -A -o wide | grep <draining node name>
There is a known issue where gatekeeper pods deployed and managed by Tanzu Mission Control (TMC) may have PDBs configured to stop the gatekeeper pods from draining.
Temporarily remove the PDB from the cluster until cluster node roll-out/replacement completes for all nodes in the cluster.
Note: Before making any changes to the existing PDBs, please consult with the application/workload owner.
kubectl get pdb <pdb-name> -n <namespace> -o yaml > <pdb-name>-backup.yaml
cat <pdb-name>-backup.yaml | less
If the rolling redeployment or upgrade has yet to reach the current control plane node, this can result in the backup being lost when the current node is replaced.
kubectl delete pdb <pdb-name> -n <namespace>
kubectl apply -f <pdb-name>-backup.yaml
kubectl get pdb <pdb-name> -n <namespace>
Official Kubernetes Documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
To avoid future issues, the total number of pod replicas or Pod Disruption Budget (PDB) can be edited to be more tolerant:
Note: Before making any changes to the existing PDBs, please consult with the application/workload owner.
kubectl edit pdb <pdb-name> -n <namespace>
#Decrease .spec.minAvailable value or Increase .spec.maxUnavailable value
kubectl get pdb <pdb-name> -n <namespace>