vCenter Disconnection Causes CNS CSI Driver Race Condition — Pods Stuck in Pending State
search cancel

vCenter Disconnection Causes CNS CSI Driver Race Condition — Pods Stuck in Pending State

book

Article ID: 433510

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

After a prolonged vCenter disconnection, reconnecting vCenter to SSP causes the control plane and worker nodes to restart. Once the SSP cluster recovers, multiple pods in the nsxi-platform namespace may remain in Pending status because their PersistentVolumeClaims (PVCs) cannot bind — the CSI provisioner is unable to enumerate shared datastores due to a stale VM reference left in the vCenter inventory.

Symptoms

Pending pods (kubectl get pods -A | grep Pending):

nsxi-platform   contextcorrelator-50e1fc9cd9aac642-exec-1    0/1   Pending   0   5d22h
nsxi-platform   overflowcorrelator-7ead0c9ce40c4adb-exec-1   0/1   Pending   0   3d22h
nsxi-platform   overflowcorrelator-7ead0c9ce40c4adb-exec-2   0/1   Pending   0   3d22h
nsxi-platform   overflowcorrelator-7ead0c9ce40c4adb-exec-3   0/1   Pending   0   3d22h
nsxi-platform   rawflowcorrelator-9017709ce4089d9e-exec-1    0/1   Pending   0   3d22h
nsxi-platform   rawflowcorrelator-9017709ce4089d9e-exec-2    0/1   Pending   0   3d22h
nsxi-platform   rawflowcorrelator-9017709ce4089d9e-exec-3    0/1   Pending   0   3d22h

 

PVC event log:

Waiting for a volume to be created either by the external provisioner
'csi.vsphere.vmware.com' or manually by the system administrator.
If volume creation is delayed, please verify that the provisioner is
running and correctly registered.

 

CSI controller log (kubectl logs <csi-controller-pod> -n vmware-system-csi):

The object 'vim.VirtualMachine:vm-223490' has already been deleted
or has not been completely created

 

PVC and PV status:

  • k get pvc -A | grep Pending shows the affected PVCs in Pending state.
  • k get pv shows no corresponding PV has been created for these PVCs.

Environment

  • SSP5.0    <== Impacted CNS driver. 
  • SSP5.1.x <== Updated CNS driver but it is possible to have the issue. 

Cause

This is a known CNS/CSI driver race condition triggered by the following sequence of events:

  • A prolonged vCenter disconnection causes the SSP control plane and worker nodes to restart upon reconnection.
  • During recovery, one or more VM objects (e.g., vim.VirtualMachine:vm-223490) are left in a deleted or incompletely created state in the vCenter inventory.
  • When the CSI provisioner attempts to enumerate shared datastores across all cluster VMs to fulfill a PVC request, it encounters the stale VM reference and returns an internal error.
  • This causes the entire datastore discovery to fail, blocking all new PVC provisioning until the stale reference is cleared and the CSI controller is restarted.

 

 

 

Resolution

Restart the CSI driver controller deployment and all CSI node driver pods. This forces the provisioner to re-enumerate the vCenter inventory, clearing the stale VM reference and unblocking PVC provisioning.

Step 1 — Restart the CSI Controller

  • Perform a rolling restart of the CSI controller deployment:
k rollout restart deployment <csi-controller-name> -n vmware-system-csi
  • Monitor the rollout until all pods are Running:
k rollout status deployment <csi-controller-name> -n vmware-system-csi

Step 2 — Restart the CSI Node Driver Pods

  • Delete all CSI node driver pods to force a restart (they will be recreated by the DaemonSet):
k delete pods <csi-node-pod-1> <csi-node-pod-2> ... -n vmware-system-csi

Step 3 — Verification

  • Confirm no pods remain in Pending state:
k get pods -A | grep Pending
  • Confirm PVCs have moved out of Pending state:
k get pvc -A | grep Pending

Once PVCs bind successfully, the Spark executor pods (contextcorrelator, overflowcorrelator, rawflowcorrelator) will be scheduled and transition to Running automatically.

Additional Information

Note: This issue presents similarly to KB 388841 (Rawflowcorrelator pod stuck in Pending state), however the root causes and resolutions are distinct.

KB 388841 is caused by a failed PVC creation during Spark driver startup due to slow network conditions, and is resolved by restarting the Spark driver pod.

This article (KB 433510) is caused by a CSI provisioner failure due to a stale VM reference following a vCenter disconnection event — restarting the Spark driver will not resolve this issue.

Confirm the cause by checking CSI controller logs for the vim.VirtualMachine error before proceeding with either resolution.