In a vSphere Supervisor environment, a workload cluster may be stuck in an unhealthy state due to a failure to drain nodes which are updating, or who have failed a machine health check from the Supervisor.
Investigating from the Supervisor cluster context:
kubectl get machines -n <workload cluster namespace>
Describing the cordoned control plane node displays a message indicating cluster-auth-pinniped-kube-cert-agent pod is not yet removed.
Status: False
Type: ControllerManagerPodHealthy
Last Transition Time: YYYY-MM-DDTHH:MM:SSZ
Message: Drain not completed yet (started at 2026-03-10T14:38:50Z):
Pod vmware-system-tmc/cluster-auth-pinniped-kube-cert-agent-#########-######: deletionTimestamp set, but still not removed from the Node
Reason: Draining
Severity: Info
Status: False
Type: DrainingSucceeded
Investigating from the workload cluster's context:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
<worker node A> Ready <none> ##h <previous VKR version>
<old control plane node> Ready,SchedulingDisabled control-plane ##h <previous VKR version>
<new control plane node> Ready control-plane ##h <desired VKR version>
kubectl get pods -A -o wide | egrep -v "Run|Completed"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
pinniped-concierge pinniped-concierge-<id> 0/1 Pending # ##h <IP> <old node name>
pinniped-concierge pinniped-concierge-kube-cert-agent-<id> 0/1 ImagePullBackOff # ##h <IP> <old node name>
vmware-system-tmc cluster-auth-pinniped-kube-cert-agent-<id> 0/1 ContainerCreating # ##h <IP> <old node name>
vSphere Supervisor
VKS 3.3 and higher
Starting in VKS 3.3 and higher, the behavior of draining nodes in a workload cluster has changed.
If a node does not drain within the workload cluster's configured node drain time-out, the upgrade will not continue.
A workload cluster with IDP enabled can result in the pinniped-concierge component from draining properly.
In this scenario, the pinniped-concierge-kube-cert-agent is unable to drain successfully and continues to restart on the draining node which causes the operation to fail.
Manual deletions of nodes will not help the upgrade to proceed.
Create a MachineDrainRule in the affected workload cluster's namespace. This must be performed in the Supervisor Cluster context.
kubectl get pods -A | grep -i pinniped
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDrainRule
metadata:
name: vks-pod-drain-skip-pinniped
spec:
drain:
behavior: Skip
pods:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: <namespace from Step 2>
selector:
matchLabels:
kube-cert-agent.pinniped.dev: v3
kubectl apply -f <MachineDrainRule.yaml> -n <workload cluster namespace>
kubectl get machinedrainrule -n <workload cluster namespace>