Deleting of EAM agency caused etcd & kernel version mismatches along with system Pods CLBO
search cancel

Deleting of EAM agency caused etcd & kernel version mismatches along with system Pods CLBO

book

Article ID: 405536

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

User followed Troubleshooting vSphere Supervisor Control Plane VMs and deleted a Supervisor VM, triggering recreation of one Supervisor VM with different Kernel and etcd versions making Supervisor Cluster unable to upgrade to next version. In order to proceed with an upgrade, the cluster status must be healthy but the cluster's a few resources may fail upon creation of the Supervisor VM which has new kernel and etcd versions. 

    • vsphere-csi-controller pods keep crashing on those nodes with older kernel versions.

      kubectl get pods -n vmware-system-csi -o wide

      NAME                                      READY   STATUS             RESTARTS      AGE    IP           NODE               NOMINATED NODE   READINESS GATES
      vsphere-csi-controller-xxx-abc12          7/7     Running            20 (10h ago)  6d1h   10.244.0.2   node-kernel-new     <none>           <none>   
      vsphere-csi-controller-xxx-def34          7/7     CrashLoopBackOff   12 (10h ago)  6d1h   10.244.0.5   node-kernel-old-1   <none>           <none>   
      vsphere-csi-controller-xxx-ghi56          7/7     CrashLoopBackOff   27 (10h ago)  6d1h   10.244.0.6   node-kernel-old-2   <none>           <none>   
      vsphere-csi-webhook-yyy-jkl78             1/1     Running            0             6d1h   10.244.0.6   node-kernel-old-2   <none>           <none>
      vsphere-csi-webhook-yyy-mno90             1/1     Running            0             6d1h   10.244.0.5   node-kernel-old-1   <none>           <none>
      vsphere-csi-webhook-yyy-pqr12             1/1     Running            4 (6d1h ago)  6d1h   10.244.0.2   node-kernel-new     <none>           <none>

    • etcdctl cluster status shows version mismatch
      +------------------------+-------------------+---------+-----------+
      |      ENDPOINT          |        ID         | VERSION | IS LEADER |
      +------------------------+-------------------+---------+-----------+
      | supervisor-node-01:2379| id-xxxxxx0001     |  3.5.7  |   false   |
      | supervisor-node-02:2379| id-xxxxxx0002     |  3.5.7  |   true    |
      | supervisor-node-03:2379| id-xxxxxx0003     |  3.5.11 |   false   |
      +------------------------+-------------------+---------+-----------+

  • The below issues occur once the Supervisor upgrade with different Kernel versions is done 
  •  
    • tanzu-cluster-api-bootstrap-kubeadm and svc-velero.vsphere.vmware.com packages are in reconcile failed state -

      kubectl get pkgi -A | egrep 'bootstrap|svc-tkg.vsphere'

      svc-tkg-domain-xxxx                  tanzu-cluster-api-bootstrap-kubeadm         cluster-api-bootstrap-kubeadm.tanzu.vmware.com       1.9.3+vmware.0                 Reconcile failed      5m
      vmware-system-supervisor-services   svc-tkg.vsphere.vmware.com                  tkg.vsphere.vmware.com                               3.3.0                          Reconcile succeeded   274d

    •  kubectl describe pkgi tanzu-cluster-api-bootstrap-kubeadm -n <namespace>  shows -
        Useful Error Message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-cl006:
        Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded
        (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-c57465bd5" has timed out progressing.)

    • kubectl get po -o wide -A | grep -v Running

      NAMESPACE        NAME                                                       READY   STATUS                RESTARTS   AGE     IP           NODE
      svc-tkg-domain-cl006
                       capi-kubeadm-bootstrap-controller-manager-xxxxx            0/2     Pending               0         41h     <none>       <none>

      svc-tmc-cl006    tmc-agent-installer-xxxxxxxx-xxxx                          0/1     Completed             0         50s     xxx.xxx.x.x   <masked-node>
      velero           backup-driver-xxxxxxxx-xxxxx                               0/1     CrashLoopBackOff      769       2d17h   xxx.xxx.x.x   <masked-node>
      velero           velero-xxxxxxxxxxxx-xxxxx                                  0/1     Init:CrashLoopBackOff 763       2d17h   xxx.xxx.x.x   <masked-node>

    • kubectl describe pod capi-kubeadm-bootstrap-controller-manager-<suffix> -n capi-kubeadm-bootstrap-system shows -
      "status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2025-07-23T12:21:31Z"
          message: '0/8 nodes are available: 3 node(s) didn''t have free ports for the requested
            pod ports, 5 node(s) didn''t match Pod''s node affinity/selector. preemption:
            0/8 nodes are available: 3 No preemption victims found for incoming pod, 5 Preemption
            is not helpful for scheduling..'
          reason: Unschedulable
          status: "False"
          type: PodScheduled
        phase: Pending
        qosClass: Burstable
      "
      hostNetwork: true
      ports:
      - containerPort: 9875   # manager
      - containerPort: 9441   # manager
      - containerPort: 8085   # manager
      - containerPort: 9845   # kube-rbac-proxy

      Because of hostNetwork: true, the above containerPorts are treated like hostPorts — they must be free on the node, otherwise the pod won't schedule.

    • Velero pods keep crashing
      kubectl get pod -n velero -o wide | grep -v Running  
      NAMESPACE   NAME                             READY   STATUS                RESTARTS   AGE     IP           NODE
      velero      backup-driver-xxxxxxxx-xxxxx      0/1     CrashLoopBackOff      769       2d17h   xxx.xxx.x.x   <masked-node>
      velero      velero-xxxxxxxxxxxx-xxxxx         0/1     Init:CrashLoopBackOff 763       2d17h   xxx.xxx.x.x   <masked-node>

 

 

Environment

vSphere with Tanzu 8.x

Cause

Deletion of the EAM agency forcibly deletes the Supervisor control plane VM. The auto-recreated VM may use a different kernel version, causing environment drift. This method of “fixing” Supervisor issues is unsupported and potentially cluster-breaking.

Do not delete EAM agencies without explicit VMware Support guidance. Versions + health state determine recoverability. Manual deletion can render the cluster unrecoverable.

Resolution

  • vsphere-csi-controller pods failed on nodes running older kernel versions. These nodes attempted to pull newer CSI container images, resulting in image pull errors and pod initialization failures. This mismatch between image expectations and node runtime environment led to persistent CrashLoopBackOff states.

    To fix this:

    1. Export and backup current CSI manifest, this is with the new 80u3(for example) images that's currently running

    kubectl get deployment vsphere-csi-controller -n kube-system -o yaml > vsphere-csi-deployment-backup.yaml

    2. Check that in the new CPVM, it has the old CSI images in registry.

    3. Edit the CSI manifest so that the deployment is using the OLD (the original version) of CSI-related images. Because this is actually what's expected by all the other components that might interact with CSI pods. You can think of that all the other pods, which are running with old 80u2(for example) but only CSI pods are running with 80u3 images, and while they are all working now, this is not guaranteed because apparently other components, if they do interact with CSI pods, cannot guarantee forward compatibility.

    4. Once you edited the CSI manifest to use the old image, it should just work because the new CPVM does have those old images in registry.
        If CSI pods comes up ok, then do SV upgrade precheck and proceed.

        Once the supervisor upgrade is completed, few resources such as "capi-kubeadm-bootstrap-controller-manager" pods fail 

  • Resource conflicts (e.g. hostNetwork or hostPort pressure) can cause one replica of capi-kubeadm-bootstrap-controller to fail.
    The other two nodes ideally should be available for scheduling.
    Delete the pending pod or any other terminating and CLBO pods.
    And scale down the deployment to zero and scale up again

  • Delete Velero pods to free resource locks
    kubectl delete pod -n velero <velero-pod-name>

  • Scale down Velero deployment:
    kubectl scale deployment velero --replicas=0 -n velero
    kubectl scale deployment velero --replicas=2 -n velero
  • Scale down capi-kubeadm-bootstrap-controller-manager, then scale up:
    kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=0 -n <namespace>

    kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=2 -n <namespace>

Additional Information

Warning: 

  • Deleting an EAM agency will DELETE the supervisor control plane VM and a new one will be created. THIS IS NOT A VALID TROUBLESHOOTING METHOD.
  • Do not delete EAM agencies without EXPRESS guidance from a VMware support engineer.
  • Depending on versions and the existing health of the supervisor cluster it is entirely possible to render the entire cluster un-recoverable.
  • If VMware by Broadcom Techincal Support finds evidence of manual EAM Agency deletion, they may mark the cluster as unsupported and require a redeploy of the entire Supervisor cluster solution.