User followed Troubleshooting vSphere Supervisor Control Plane VMs and deleted a Supervisor VM, triggering recreation of one Supervisor VM with different Kernel and etcd versions making Supervisor Cluster unable to upgrade to next version. In order to proceed with an upgrade, the cluster status must be healthy but the cluster's a few resources may fail upon creation of the Supervisor VM which has new kernel and etcd versions.
vsphere-csi-controller pods keep crashing on those nodes with older kernel versions.
kubectl get pods -n vmware-system-csi -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESvsphere-csi-controller-xxx-abc12 7/7 Running 20 (10h ago) 6d1h 10.244.0.2 node-kernel-new <none> <none> vsphere-csi-controller-xxx-def34 7/7 CrashLoopBackOff 12 (10h ago) 6d1h 10.244.0.5 node-kernel-old-1 <none> <none> vsphere-csi-controller-xxx-ghi56 7/7 CrashLoopBackOff 27 (10h ago) 6d1h 10.244.0.6 node-kernel-old-2 <none> <none> vsphere-csi-webhook-yyy-jkl78 1/1 Running 0 6d1h 10.244.0.6 node-kernel-old-2 <none> <none>vsphere-csi-webhook-yyy-mno90 1/1 Running 0 6d1h 10.244.0.5 node-kernel-old-1 <none> <none>vsphere-csi-webhook-yyy-pqr12 1/1 Running 4 (6d1h ago) 6d1h 10.244.0.2 node-kernel-new <none> <none>
------------------------+-------------------+---------+-----------+| ENDPOINT | ID | VERSION | IS LEADER |+------------------------+-------------------+---------+-----------+| supervisor-node-01:2379| id-xxxxxx0001 | 3.5.7 | false || supervisor-node-02:2379| id-xxxxxx0002 | 3.5.7 | true || supervisor-node-03:2379| id-xxxxxx0003 | 3.5.11 | false |
+------------------------+-------------------+---------+-----------+kubectl get pkgi -A | egrep 'bootstrap|svc-tkg.vsphere'
svc-tkg-domain-xxxx tanzu-cluster-api-bootstrap-kubeadm cluster-api-bootstrap-kubeadm.tanzu.vmware.com 1.9.3+vmware.0 Reconcile failed 5mvmware-system-supervisor-services svc-tkg.vsphere.vmware.com tkg.vsphere.vmware.com 3.3.0 Reconcile succeeded 274d
kubectl describe pkgi tanzu-cluster-api-bootstrap-kubeadm -n <namespace> shows - Useful Error Message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-cl006: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-c57465bd5" has timed out progressing.)
kubectl get po -o wide -A | grep -v RunningNAMESPACE NAME READY STATUS RESTARTS AGE IP NODEsvc-tkg-domain-cl006 capi-kubeadm-bootstrap-controller-manager-xxxxx 0/2 Pending 0 41h <none> <none>
svc-tmc-cl006 tmc-agent-installer-xxxxxxxx-xxxx 0/1 Completed 0 50s xxx.xxx.x.x <masked-node>velero backup-driver-xxxxxxxx-xxxxx 0/1 CrashLoopBackOff 769 2d17h xxx.xxx.x.x <masked-node>velero velero-xxxxxxxxxxxx-xxxxx 0/1 Init:CrashLoopBackOff 763 2d17h xxx.xxx.x.x <masked-node>
kubectl describe pod capi-kubeadm-bootstrap-controller-manager-<suffix> -n capi-kubeadm-bootstrap-system shows -"status: conditions: - lastProbeTime: null lastTransitionTime: "2025-07-23T12:21:31Z" message: '0/8 nodes are available: 3 node(s) didn''t have free ports for the requested pod ports, 5 node(s) didn''t match Pod''s node affinity/selector. preemption: 0/8 nodes are available: 3 No preemption victims found for incoming pod, 5 Preemption is not helpful for scheduling..' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable"hostNetwork: trueports:- containerPort: 9875 # manager- containerPort: 9441 # manager- containerPort: 8085 # manager- containerPort: 9845 # kube-rbac-proxykubectl get pod -n velero -o wide | grep -v Running NAMESPACE NAME READY STATUS RESTARTS AGE IP NODEvelero backup-driver-xxxxxxxx-xxxxx 0/1 CrashLoopBackOff 769 2d17h xxx.xxx.x.x <masked-node>velero velero-xxxxxxxxxxxx-xxxxx 0/1 Init:CrashLoopBackOff 763 2d17h xxx.xxx.x.x <masked-node>
vSphere with Tanzu 8.x
Deletion of the EAM agency forcibly deletes the Supervisor control plane VM. The auto-recreated VM may use a different kernel version, causing environment drift. This method of “fixing” Supervisor issues is unsupported and potentially cluster-breaking.
Do not delete EAM agencies without explicit VMware Support guidance. Versions + health state determine recoverability. Manual deletion can render the cluster unrecoverable.
vsphere-csi-controller pods failed on nodes running older kernel versions. These nodes attempted to pull newer CSI container images, resulting in image pull errors and pod initialization failures. This mismatch between image expectations and node runtime environment led to persistent CrashLoopBackOff states. 1. Export and backup current CSI manifest, this is with the new 80u3(for example) images that's currently runningkubectl get deployment vsphere-csi-controller -n kube-system -o yaml > vsphere-csi-deployment-backup.yamlcapi-kubeadm-bootstrap-controller-manager" pods fail kubectl delete pod -n velero <velero-pod-name>
kubectl scale deployment velero --replicas=0 -n velerokubectl scale deployment velero --replicas=2 -n velerocapi-kubeadm-bootstrap-controller-manager, then scale up:kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=0 -n <namespace>kubectl scale deployment capi-kubeadm-bootstrap-controller-manager --replicas=2 -n <namespace>Warning: