Deleting of EAM agency caused etcd & kernel version mismatches along with system Pods CLBO

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

User followed Troubleshooting vSphere Supervisor Control Plane VMs and deleted a Supervisor VM, triggering recreation of one Supervisor VM with different Kernel and etcd versions making Supervisor Cluster unable to upgrade to next version. In order to proceed with an upgrade, the cluster status must be healthy but the cluster's a few resources may fail upon creation of the Supervisor VM which has new kernel and etcd versions.

- vsphere-csi-controller pods keep crashing on those nodes with older kernel versions.
  
  kubectl get pods -n vmware-system-csi -o wide
  
  NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  vsphere-csi-controller-xxx-abc12 7/7 Running 20 (10h ago) 6d1h 10.244.0.2 node-kernel-new <none> <none>
  vsphere-csi-controller-xxx-def34 7/7 CrashLoopBackOff 12 (10h ago) 6d1h 10.244.0.5 node-kernel-old-1 <none> <none>
  vsphere-csi-controller-xxx-ghi56 7/7 CrashLoopBackOff 27 (10h ago) 6d1h 10.244.0.6 node-kernel-old-2 <none> <none>
  vsphere-csi-webhook-yyy-jkl78 1/1 Running 0 6d1h 10.244.0.6 node-kernel-old-2 <none> <none>
  vsphere-csi-webhook-yyy-mno90 1/1 Running 0 6d1h 10.244.0.5 node-kernel-old-1 <none> <none>
  vsphere-csi-webhook-yyy-pqr12 1/1 Running 4 (6d1h ago) 6d1h 10.244.0.2 node-kernel-new <none> <none>
- etcdctl cluster status shows version mismatch
  +------------------------+-------------------+---------+-----------+
  | ENDPOINT | ID | VERSION | IS LEADER |
  +------------------------+-------------------+---------+-----------+
  | supervisor-node-01:2379| id-xxxxxx0001 | 3.5.7 | false |
  | supervisor-node-02:2379| id-xxxxxx0002 | 3.5.7 | true |
  | supervisor-node-03:2379| id-xxxxxx0003 | 3.5.11 | false |+------------------------+-------------------+---------+-----------+
The below issues occur once the Supervisor upgrade with different Kernel versions is done
- tanzu-cluster-api-bootstrap-kubeadm and svc-velero.vsphere.vmware.com packages are in reconcile failed state -
  
  kubectl get pkgi -A | egrep 'bootstrap|svc-tkg.vsphere'
  
  svc-tkg-domain-xxxx tanzu-cluster-api-bootstrap-kubeadm cluster-api-bootstrap-kubeadm.tanzu.vmware.com 1.9.3+vmware.0 Reconcile failed 5m
  vmware-system-supervisor-services svc-tkg.vsphere.vmware.com tkg.vsphere.vmware.com 3.3.0 Reconcile succeeded 274d
- kubectl describe pkgi tanzu-cluster-api-bootstrap-kubeadm -n <namespace> shows -
  
  Useful Error Message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-cl006:
  Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded
  (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-c57465bd5" has timed out progressing.)
- kubectl get po -o wide -A | grep -v Running
  
  NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
  svc-tkg-domain-cl006
  capi-kubeadm-bootstrap-controller-manager-xxxxx 0/2 Pending 0 41h <none> <none>
  
  svc-tmc-cl006 tmc-agent-installer-xxxxxxxx-xxxx 0/1 Completed 0 50s xxx.xxx.x.x <masked-node>
  velero backup-driver-xxxxxxxx-xxxxx 0/1 CrashLoopBackOff 769 2d17h xxx.xxx.x.x <masked-node>
  velero velero-xxxxxxxxxxxx-xxxxx 0/1 Init:CrashLoopBackOff 763 2d17h xxx.xxx.x.x <masked-node>
- kubectl describe pod capi-kubeadm-bootstrap-controller-manager-<suffix> -n capi-kubeadm-bootstrap-system shows -
  "status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2025-07-23T12:21:31Z"
      message: '0/8 nodes are available: 3 node(s) didn''t have free ports for the requested
        pod ports, 5 node(s) didn''t match Pod''s node affinity/selector. preemption:
        0/8 nodes are available: 3 No preemption victims found for incoming pod, 5 Preemption
        is not helpful for scheduling..'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
    phase: Pending
    qosClass: Burstable
  "
  hostNetwork: true
  ports:
  - containerPort: 9875 # manager
  - containerPort: 9441 # manager
  - containerPort: 8085 # manager
  - containerPort: 9845 # kube-rbac-proxy
  
  Because of hostNetwork: true, the above containerPorts are treated like hostPorts — they must be free on the node, otherwise the pod won't schedule.
- Velero pods keep crashing
  kubectl get pod -n velero -o wide | grep -v Running
  NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
  velero backup-driver-xxxxxxxx-xxxxx 0/1 CrashLoopBackOff 769 2d17h xxx.xxx.x.x <masked-node>
  velero velero-xxxxxxxxxxxx-xxxxx 0/1 Init:CrashLoopBackOff 763 2d17h xxx.xxx.x.x <masked-node>

Environment

vSphere with Tanzu 8.x

Cause

Deletion of the EAM agency forcibly deletes the Supervisor control plane VM. The auto-recreated VM may use a different kernel version, causing environment drift. This method of “fixing” Supervisor issues is unsupported and potentially cluster-breaking.

Do not delete EAM agencies without explicit VMware Support guidance. Versions + health state determine recoverability. Manual deletion can render the cluster unrecoverable.

Resolution

vsphere-csi-controller pods failed on nodes running older kernel versions. These nodes attempted to pull newer CSI container images, resulting in image pull errors and pod initialization failures. This mismatch between image expectations and node runtime environment led to persistent CrashLoopBackOff states.

To fix this:

1. Export and backup current CSI manifest, this is with the new 80u3(for example) images that's currently running

kubectl get deployment vsphere-csi-controller -n kube-system -o yaml > vsphere-csi-deployment-backup.yaml

2. Check that in the new CPVM, it has the old CSI images in registry.

3. Edit the CSI manifest so that the deployment is using the OLD (the original version) of CSI-related images. Because this is actually what's expected by all the other components that might interact with CSI pods. You can think of that all the other pods, which are running with old 80u2(for example) but only CSI pods are running with 80u3 images, and while they are all working now, this is not guaranteed because apparently other components, if they do interact with CSI pods, cannot guarantee forward compatibility.

4. Once you edited the CSI manifest to use the old image, it should just work because the new CPVM does have those old images in registry.
If CSI pods comes up ok, then do SV upgrade precheck and proceed.

Once the supervisor upgrade is completed, few resources such as "capi-kubeadm-bootstrap-controller-manager" pods fail
Resource conflicts (e.g. hostNetwork or hostPort pressure) can cause one replica of capi-kubeadm-bootstrap-controller to fail.
The other two nodes ideally should be available for scheduling.
Delete the pending pod or any other terminating and CLBO pods.
And scale down the deployment to zero and scale up again
Delete Velero pods to free resource locks
kubectl delete pod -n velero <velero-pod-name>
Scale down Velero deployment:
kubectl scale deployment velero --replicas=0 -n velero
kubectl scale deployment velero --replicas=2 -n velero

Scale down capi-kubeadm-bootstrap-controller-manager, then scale up:
kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=0 -n <namespace>
kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=2 -n <namespace>

Additional Information

Warning:

Deleting an EAM agency will DELETE the supervisor control plane VM and a new one will be created. THIS IS NOT A VALID TROUBLESHOOTING METHOD.
Do not delete EAM agencies without EXPRESS guidance from a VMware support engineer.
Depending on versions and the existing health of the supervisor cluster it is entirely possible to render the entire cluster un-recoverable.
If VMware by Broadcom Techincal Support finds evidence of manual EAM Agency deletion, they may mark the cluster as unsupported and require a redeploy of the entire Supervisor cluster solution.