Pod is stuck in ContainerCreating state due to persistent volume attachment failure

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Pod on TKGI cluster is stuck in ContainerCreating state due to persistent volume attachment failure. Describing the problematic pod would show errors like the following.

$ kubectl describe -n pod1
Name:             pod1
Namespace:        ns1
Node:             07a5e858-####-3732d04ff5b2/10.##.##.7
Status:           Pending
IP:               
IPs:              <none>

Containers:
......
Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Normal   Scheduled           18m                default-scheduler        Successfully assigned pod1 to 07a5e858-####-3732d04ff5b2
  Warning  FailedAttachVolume  65s (x9 over 17m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-71fd3122-####-ef4978785b69" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"4230df9e-####-96acfb4a1bb8". Error: virtual machine wasn't found

Environment

Tanzu Kubernetes Grid Integrated Edition

Cause

The error is about vsphere-csi-controller failing to find the corresponding virtual machine for attaching the persistent volume to the assigned node. However, the Kubernetes cluster node 07a5e858-####-3732d04ff5b2 details indicate it's really linked to vSphere virtual machine with providerID 4230df9e-####-96acfb4a1bb8.

$ kubectl get node 07a5e858-####-3732d04ff5b2 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"csi.vsphere.vmware.com":"4230df9e-####-96acfb4a1bb8"}'
  ......
  name: 07a5e858-####-3732d04ff5b2
  uid: 24db1d73-####-9d9527fc6bb9
spec:
  providerID: vsphere://4230df9e-####-96acfb4a1bb8

vsphere-cloud-controller-manager job logs on master instance also shows the associated virtual machine could be found for node with provideID 4230df9e-####-96acfb4a1bb8 on vCenter datacenter "DATACENTER_1"

master/0f8c3ec8-####-e3b59f19008d: stdout | I0501 03:58:51.965215  833372 search.go:208] Found node 4230df9e-####-96acfb4a1bb8 as vm=VirtualMachine:vm-277281 in vc=vc.example.net and datacenter=DATACENTER_1

But vsphere-csi-controller logs show it had tried to find virtual machine with provideID 4230df9e-####-96acfb4a1bb8 on another datacenter "DATACENTER_2" and failed with the reported error.

{"level":"warn","time":"2026-05-03T03:08:04.428441154Z","caller":"vsphere/virtualmachine.go:165","msg":"Couldn't find VM given uuid 4230df9e-####-96acfb4a1bb8 on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /DATACENTER_2, VirtualCenterHost: 10.##.##.62] with err: virtual machine wasn't found, continuing search","TraceId":"bd8f396d-####-bc884ff6b795"}

So the issue happens because that the vsphere-csi-controller is searching on wrong datacenter for the virtual machine. Possible reasons are:

1) The datacenter configured on TKGI tile Kubernetes Cloud Provider storage settings has been changed. However, the change is not applied to the TKGI cluster

2) A separated vsphere-csi-controller deployment is still existing on the TKGI cluster with datacenter value not matching that's configured on TKGI tile Kubernetes Cloud Provider storage settings. Review vsphere-config-secret secret in vmware-system-csi namespace for detailed settings

Resolution

To solve the issue in different scenarios:

1) Double check the datacenter configured on TKGI tile Kubernetes Cloud Provider storage settings is correct. Then check running manifest of the BOSH deployment for TKGI cluster to make sure the right datacenter being set. If not, probably the datacetner value has been changed on tile settings but not applied to the TKGI cluster. Try to run the "Upgrade all clusters" errand or command "tkgi upgrade-cluster" for individual cluster.

2) If vsphere-csi-controller deployment is existing on the cluster with wrong datacenter set in vsphere-config-secret secret, check if "vSphere CSI Driver Integration" feature is enabled or not on TKGI tile settings (Storage pane). If it's enabled, the csi-* related jobs should be deployed on each master instance of BOSH deployment for the TKGI cluster. For example,

$ bosh -d service-instance_89ff476b-####-34ad78183c63 is --ps

Instance                                           Process                           Process State  AZ   IPs            Deployment    
master/ba4ba28e-####-a75d5e9971c9                  -                                 running        az1  10.##.##.186   service-instance_89ff476b-####-34ad78183c63  
~                                                  csi-attacher                      running        -    -              -  
~                                                  csi-controller                    running        -    -              -  
~                                                  csi-livenessprobe                 running        -    -              -  
~                                                  csi-provisioner                   running        -    -              -  
~                                                  csi-resizer                       running        -    -              -  
~                                                  csi-snapshotter                   running        -    -              -  
~                                                  csi-syncer                        running        -    -              -  
......

In such case it means vsphere-csi-controller deployment on the TKGI cluster is not needed any more and should be deleted.

If "vSphere CSI Driver Integration" feature is not enabled and vsphere-csi-controller deployment is needed to provide the CNS function, update vsphere-config-secret secret with correct datacenter followed by restarting all pods in vsphere-csi-controller deployment.

$ kubectl rollout restart deployment/vsphere-csi-controller -n vmware-system-csi

Additional Information

References:

Deploying Cloud Native Storage (CNS) on vSphere

Create a Kubernetes Secret for vSphere Container Storage Plug-in

Persistent volumes cannot attach to a new node if previous node is deleted