Unable to Create Volume with vSphere CSI Driver | ServerFaultCode | 'vim.VirtualMachine:vm-x' has already been deleted or has not been completely created

search cancel

Unable to Create Volume with vSphere CSI Driver | ServerFaultCode | 'vim.VirtualMachine:vm-x' has already been deleted or has not been completely created

book

Article ID: 298711

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

This article discusses a specific issue in the vSphere CSI driver, where duplicated detach volume calls after a worker node deletion lead to stale node VM information persisting in the cache. This stale information causes volume provisioning failures and pod scheduling issues.

Symptoms:

Customers using the vSphere CSI driver versions 2.7.2 and below may experience issues related to persistent volume claim (PVC) provisioning and pod scheduling in Kubernetes. The symptoms manifest as:

Pods failing to schedule with events logged as Event Type: Warning, Reason: FailedScheduling. The logs detail that nodes are available, but pods have unbound immediate PersistentVolumeClaims, with messages like:
- [num/num] nodes are available: [num] pod has unbound immediate PersistentVolumeClaims. Preemption: [num/num] nodes are available: [num] Preemption is not helpful for scheduling.
PersistentVolumeClaims stuck in a provisioning loop, with events of Event Type: Normal, Reason: Provisioning. The provisioning attempts are logged with messages stating:
- Provisioner is provisioning volume for claim "[claim name]" (repeated [number of retries] over [time period]) - waiting for a volume to be created, either by external provisioner or manually by system administrator.
The CSI controller logs on master nodes (csi-provisioner.stderr.log and csi-controller.stderr.log) showing Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", ...}) with type Warning, reason ProvisioningFailed, detailing failures in provisioning volume with StorageClass "class" due to errors fetching shared datastores, accompanied by ServerFaultCode and indicated by the error:
- failed to get host system for VM with err: ServerFaultCode: The object 'vim.VirtualMachine:vm-x' has already been deleted or has not been completely created

This results in pods failing to schedule due to unbound immediate PersistentVolumeClaims, negatively impacting application deployment and scaling within the Kubernetes cluster.

Environment

Product Version: 1.16

Cause

The root cause is identified as the CSI driver receiving an informer event for a node deletion and removing the node information from the cache as expected. However, in certain scenarios, the CSI driver receives duplicated detach volume calls, and during these calls, node information is mistakenly re-added to the cache. When the system later deletes the node VM from the vCenter inventory without updating the cache, the cache retains the deleted node VM information. Consequently, when creating volumes, the driver fails with errors due to reliance on this stale node VM information.

Resolution

The issue has been addressed in vSphere CSI driver version 2.7.3 and TKGi version 1.16.6 by implementing a change that prevents the addition of node information to the cache during attach or detach operations. This adjustment ensures the cache remains accurate, reflecting the true state of the Kubernetes cluster and resolving the provisioning and scheduling issues.

For users on affected versions (CSI driver 2.7.2 and below, TKGi versions before 1.16.6), restarting the CSI controller can serve as a temporary workaround to clear the stale node VM information from the cache:

sudo monit restart csi-controller

However, upgrading to the fixed versions of the CSI driver and TKGi is recommended for a permanent solution.

Feedback

thumb_up Yes

thumb_down No