Spark App Pods in Pending State — PVC Creation Failing Due to VMFS vclock Corruption

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

In SSP side, the pods are in pending status with waiting for PVC creation.

This is a vSphere/ESXi infrastructure issue — not an SSP software defect. However it directly impacts SSP and any Tanzu Kubernetes cluster relying on CNS/VMFS storage as PVC creation fails, causing pod scheduling failures across multiple SSP workloads.

SSP pods affected — stuck in Pending state:

             <none>           <none>
nsxi-platform       overflowcorrelator-8ec88e9c96637f93-exec-1                        0/1     Pending     0              2d      <none>          <none>                           <none>           <none>
nsxi-platform       overflowcorrelator-8ec88e9c96637f93-exec-2                        0/1     Pending     0              2d      <none>          <none>                           <none>           <none>
nsxi-platform       overflowcorrelator-8ec88e9c96637f93-exec-3                        0/1     Pending     0              2d      <none>          <none>                           <none>           <none>
nsxi-platform       rawflowcorrelator-f2b8549c966379f5-exec-1                         0/1     Pending     0              2d      <none>          <none>                           <none>           <none>
nsxi-platform       rawflowcorrelator-f2b8549c966379f5-exec-2                         0/1     Pending     0              2d      <none>          <none>                           <none>           <none>
nsxi-platform       rawflowcorrelator-f2b8549c966379f5-exec-3                         0/1     Pending     0              2d      <none>          <none>                           <none>           <none>

Pod scheduling error:

Warning FailedScheduling: 0/8 nodes are available:
pod has unbound immediate PersistentVolumeClaims

Error Messages

ESXi hostd log — VSLM disk creation failure:

Vslm Failure: VslmCreateDisk failed for fcd on datastore
/vmfs/volumes/<datastore-id>/
with type vim.fault.DatabaseError
Fault cause: vim.fault.DatabaseError

ESXi vmkernel log — ATS lock failure:

DLX: vol 'GLC-xxxxx-<id>', lock at <offset>:
Lock type: 10C00001. [Req mode 1]
try lock error: Atomic test and set of disk block
returned false for equality

CSI controller log — CNS volume creation failure:

failed to create disk <pvc-name> with error:
failed to create volume with fault:
CnsFault error: VSLM task failed

Environment

SSP 5.0
SSP 5.1.0
SSP 5.1.1
Any Tanzu Kubernetes cluster using vSphere CSI driver with VMFS datastore

Cause

The core issue is a VMFS vclock corruption on the vDefend datastore that caused the vclock file to stop advancing its tick counter. The vclock mechanism uses a file rename operation to atomically increment its counter. When that rename operation failed at the VMFS layer, the FCD catalog database became unable to register new disk operations.

This caused every new createDisk call from the CSI driver to fail with vim.fault.DatabaseError — Database temporarily unavailable or has network problems. Existing PVCs continued working normally as they do not require new catalog entries — only new PVC provisioning was blocked.

The vclock file exhibited a ghost file behavior where it appeared in directory listings but all file operations (rm, mv) failed with No such file or directory, indicating VMFS-level corruption of the catalog entry.

Resolution

This issue requires collaboration across multiple teams. Do not attempt the vclock remediation steps without involving the appropriate teams:

Storage Admin — responsible for the underlying datastore health, ATS configuration, or vSAN storage layer
vSphere Admin — responsible for ESXi host operations including hostd restarts, datastore reconciliation, and MOB page operations
Broadcom Support — should be engaged to oversee the vclock file remediation procedure and confirm the fix is appropriate for the specific environment before any changes are made

Engage Broadcom Support by opening a support request and providing ESXi support bundles, hostd logs, vmkernel logs, and catalog logs covering the period when the issue started. Do not perform the vclock recreation steps in a production environment without Broadcom Support guidance.