ContainerCreating or Pending state.vsphere-csi-controller pod logs shows "Node not found" errors referenced by the Node UUID: Readonly:false Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1769022798126-6281-csi.vsphere.vmware.com type:vSphere CNS Block Volume] ###_NoUnkeyedLiteral:{} ###_unrecognized:[] ###_sizecache:0}","TraceId":"d#######-####-####-####-########13"}
{"level":"error","time":"YYYY-MM-DDTHH:MM:SS","caller":"node/manager.go:172","msg":"Node not found with nodeName 4######-####-####-####-########b"
VMware vCenter Server
VMware vSphere Kubernetes Service
This issue is caused by the inherent limitation of PCI Passthrough devices, which bind a VM directly to the physical hardware of a specific ESXi host because of which VM cannot be live migrated.
When the VM is abruptly powered off to bypass the hang, the vSphere Container Storage Plug-in i.e. CSI, does not receive a graceful detach signal and when the VM is powered on again, the CSI controller still reference the old node UUID and believes that the volume is still attached to the old node.
Follow the below Considerations and Best Practices for GPU enabled workloads:
Refer to the following documentation for more information on how to configure NVIDIA GPU devices on Kubernetes workloads: Configuring NVIDIA GPU devices for Kubernetes Cluster.
If the environment requires PCI Passthrough on the worker nodes, the following operational procedure should be considered to put the ESXi host into Maintenance Mode: