Workload Cluster Upgrade Stuck on vSphere Kubernetes Service (VKS) 3.3 and higher due to Pinniped-Concierge Pods
search cancel

Workload Cluster Upgrade Stuck on vSphere Kubernetes Service (VKS) 3.3 and higher due to Pinniped-Concierge Pods

book

Article ID: 410900

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

In a vSphere Supervisor environment, a workload cluster may be stuck in an unhealthy state due to a failure to drain nodes which are updating, or who have failed a machine health check from the Supervisor. 

Investigating from the Supervisor cluster context:

  • A workload cluster node is stuck in Deleting state:
    kubectl get machines -n <workload cluster namespace>
  • Describing the cordoned control plane node displays a message indicating cluster-auth-pinniped-kube-cert-agent pod is not yet removed.

    Status:                False
            Type:                  ControllerManagerPodHealthy
            Last Transition Time:  YYYY-MM-DDTHH:MM:SSZ
            Message:               Drain not completed yet (started at 2026-03-10T14:38:50Z):
      Pod vmware-system-tmc/cluster-auth-pinniped-kube-cert-agent-#########-######: deletionTimestamp set, but still not removed from the Node
            Reason:                Draining
            Severity:              Info
            Status:                False
            Type:                  DrainingSucceeded

 

Investigating from the workload cluster's context:

  • The status of the Deleting node shows as Ready, Scheduling Disabled state:
    In the below example, a updated control plane node was created successfully but the old control plane node is stuck draining.
    kubectl get nodes
    
    NAME                                 STATUS                     ROLES           AGE   VERSION
    <worker node A>                      Ready                      <none>          ##h   <previous VKR version>
    <old control plane node>             Ready,SchedulingDisabled   control-plane   ##h   <previous VKR version>
    <new control plane node>             Ready                      control-plane   ##h   <desired VKR version>

     

  • A 'pinniped-concierge-kube-cert-agent' pod recreates every 20 seconds on the affected node, preventing the drain operation from completing:
    The below pinniped pods are an example. Names and namespace may vary by environment.
    kubectl get pods -A -o wide | egrep -v "Run|Completed"
    
    NAMESPACE                      NAME                                        READY   STATUS             RESTARTS      AGE     IP    NODE
    pinniped-concierge             pinniped-concierge-<id>                     0/1     Pending            #             ##h    <IP>  <old node name>
    pinniped-concierge             pinniped-concierge-kube-cert-agent-<id>     0/1     ImagePullBackOff   #            ##h    <IP>  <old node name>
    vmware-system-tmc            cluster-auth-pinniped-kube-cert-agent-<id> 0/1 ContainerCreating   #            ##h    <IP>  <old node name>

 

Environment

vSphere Supervisor
VKS 3.3 and higher

Cause

Starting in VKS 3.3 and higher, the behavior of draining nodes in a workload cluster has changed.

If a node does not drain within the workload cluster's configured node drain time-out, the upgrade will not continue.

A workload cluster with IDP enabled can result in the pinniped-concierge component from draining properly. 

In this scenario, the pinniped-concierge-kube-cert-agent is unable to drain successfully and continues to restart on the draining node which causes the operation to fail. 

Manual deletions of nodes will not help the upgrade to proceed.

Resolution

Workaround:

Create a MachineDrainRule in the affected workload cluster's namespace. This must be performed in the Supervisor Cluster context.

  1. Connect into the affected workload cluster's context

  2. Note down the namespace of the recreating pinniped pods in the workload cluster:
    kubectl get pods -A | grep -i pinniped
  3. Connect into the Supervisor Cluster context

  4. Create a file with the below MachineDrainRule contents:
    apiVersion: cluster.x-k8s.io/v1beta1
    kind: MachineDrainRule
    metadata:
      name: vks-pod-drain-skip-pinniped
    spec:
      drain:
        behavior: Skip
      pods:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: <namespace from Step 2>
        selector:
          matchLabels:
            kube-cert-agent.pinniped.dev: v3
  5. Apply the above YAML file into the workload cluster's namespace:
    kubectl apply -f <MachineDrainRule.yaml> -n <workload cluster namespace>
  6. Confirm that the machineDrainRule was created in the desired workload cluster namespace:
    kubectl get machinedrainrule -n <workload cluster namespace>
  7. The affected node should complete the drain operation and the node update/redeployment will succeed.