Pod on TKGI cluster is stuck in ContainerCreating state due to persistent volume attachment failure. Describing the problematic pod would show errors like the following.
$ kubectl describe -n pod1
Name: pod1
Namespace: ns1
Node: 07a5e858-####-3732d04ff5b2/10.##.##.7
Status: Pending
IP:
IPs: <none>
Containers:
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned pod1 to 07a5e858-####-3732d04ff5b2
Warning FailedAttachVolume 65s (x9 over 17m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-71fd3122-####-ef4978785b69" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"4230df9e-####-96acfb4a1bb8". Error: virtual machine wasn't found
Tanzu Kubernetes Grid Integrated Edition
The error is about vsphere-csi-controller failing to find the corresponding virtual machine for attaching the persistent volume to the assigned node. However, the Kubernetes cluster node 07a5e858-####-3732d04ff5b2 details indicate it's really linked to vSphere virtual machine with providerID 4230df9e-####-96acfb4a1bb8.
$ kubectl get node 07a5e858-####-3732d04ff5b2 -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"csi.vsphere.vmware.com":"4230df9e-####-96acfb4a1bb8"}'
......
name: 07a5e858-####-3732d04ff5b2
uid: 24db1d73-####-9d9527fc6bb9
spec:
providerID: vsphere://4230df9e-####-96acfb4a1bb8
vsphere-cloud-controller-manager job logs on master instance also shows the associated virtual machine could be found for node with provideID 4230df9e-####-96acfb4a1bb8 on vCenter datacenter "DATACENTER_1"
master/0f8c3ec8-####-e3b59f19008d: stdout | I0501 03:58:51.965215 833372 search.go:208] Found node 4230df9e-####-96acfb4a1bb8 as vm=VirtualMachine:vm-277281 in vc=vc.example.net and datacenter=DATACENTER_1
But vsphere-csi-controller logs show it had tried to find virtual machine with provideID 4230df9e-####-96acfb4a1bb8 on another datacenter "DATACENTER_2" and failed with the reported error.
{"level":"warn","time":"2026-05-03T03:08:04.428441154Z","caller":"vsphere/virtualmachine.go:165","msg":"Couldn't find VM given uuid 4230df9e-####-96acfb4a1bb8 on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /DATACENTER_2, VirtualCenterHost: 10.##.##.62] with err: virtual machine wasn't found, continuing search","TraceId":"bd8f396d-####-bc884ff6b795"}
So the issue happens because that the vsphere-csi-controller is searching on wrong datacenter for the virtual machine. Possible reasons are:
1) The datacenter configured on TKGI tile Kubernetes Cloud Provider storage settings has been changed. However, the change is not applied to the TKGI cluster
2) A separated vsphere-csi-controller deployment is still existing on the TKGI cluster with datacenter value not matching that's configured on TKGI tile Kubernetes Cloud Provider storage settings. Review vsphere-config-secret secret in vmware-system-csi namespace for detailed settings
To solve the issue in different scenarios:
1) Double check the datacenter configured on TKGI tile Kubernetes Cloud Provider storage settings is correct. Then check running manifest of the BOSH deployment for TKGI cluster to make sure the right datacenter being set. If not, probably the datacetner value has been changed on tile settings but not applied to the TKGI cluster. Try to run the "Upgrade all clusters" errand or command "tkgi upgrade-cluster" for individual cluster.
2) If vsphere-csi-controller deployment is existing on the cluster with wrong datacenter set in vsphere-config-secret secret, check if "vSphere CSI Driver Integration" feature is enabled or not on TKGI tile settings (Storage pane). If it's enabled, the csi-* related jobs should be deployed on each master instance of BOSH deployment for the TKGI cluster. For example,
$ bosh -d service-instance_89ff476b-####-34ad78183c63 is --ps
Instance Process Process State AZ IPs Deployment
master/ba4ba28e-####-a75d5e9971c9 - running az1 10.##.##.186 service-instance_89ff476b-####-34ad78183c63
~ csi-attacher running - - -
~ csi-controller running - - -
~ csi-livenessprobe running - - -
~ csi-provisioner running - - -
~ csi-resizer running - - -
~ csi-snapshotter running - - -
~ csi-syncer running - - -
......
In such case it means vsphere-csi-controller deployment on the TKGI cluster is not needed any more and should be deleted.
If "vSphere CSI Driver Integration" feature is not enabled and vsphere-csi-controller deployment is needed to provide the CNS function, update vsphere-config-secret secret with correct datacenter followed by restarting all pods in vsphere-csi-controller deployment.
$ kubectl rollout restart deployment/vsphere-csi-controller -n vmware-system-csi