vsan-health crashes with error "Failing pending CNS tasks during startup"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

CNS deleteVolume API does not delete the volume from the CNS DB but sets the mark_for_delete flag to true.

Symptoms:

Unable to retrieve cluster vsan-health from vCenter
vsan-health service not running state on vCenter
start service will be successful. However, it crashes in a couple of minutes
When you check vsanvcmgmtd.log under path /var/log/vmware/vsan-health on vCenter server, you may see similar events as below.

2025-03-07T04:32:17.390Z info vsanvcmgmtd[580209] [vSAN@6876 sub=PHM::PhmInventoryListener opId=vsan-wfu-2b03] ProcessUpdate called
2025-03-07T04:32:17.390Z info vsanvcmgmtd[580209] [vSAN@6876 sub=PHM::PhmInventoryListener opId=vsan-wfu-2b03] ProcessUpdate: Update kind: 'enter' or 'leave'. Ignoring the update
2025-03-07T04:32:17.390Z info vsanvcmgmtd[580113] [vSAN@6876 sub=CnsDb] Loaded 34195 volumes out of 34195 volumes from DB.
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580113] [vSAN@6876 sub=pcs[0]] Registered listener '[CnsDatastoreListener:0x000055a5374f3d60]'
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Fail Cns InProgress Tasks
2025-03-07T04:32:17.391Z info vsanvcmgmtd[580211] [vSAN@6876 sub=PropertyCollectorService] CNS: Gathering CNS Tasks
2025-03-07T04:32:17.394Z info vsanvcmgmtd[580217] [vSAN@6876 sub=pcs[0]] Started listerner '[CnsDatastoreListener:0x000055a5374f3d60]'
2025-03-07T04:32:17.394Z info vsanvcmgmtd[580211] [vSAN@6876 sub=vmomi.soapStub[5]] SOAP request returned HTTP failure; <<io_obj p:0x00007fa3a40dbb40, h:27, <TCP '127.0.0.1 : 37400'>, <TCP '127.0.0.1 : 1080'>>, /sdk>, method: GetRecentTask; code: 500(Internal Server Error); fault: (vim.fault.NotAuthenticated) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    object = 'vim.TaskManager:8cbe3917-25c7-4cdc-a28f-53ece89a068e:TaskManager',
-->    privilegeId = "",
-->    missingPrivileges = <unset>
-->    msg = "Received SOAP response fault from [<<io_obj p:0x00007fa3a40dbb40, h:27, <TCP '127.0.0.1 : 37400'>, <TCP '127.0.0.1 : 1080'>>, /sdk>]: GetRecentTask
--> The session is not authenticated."
--> }
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580221] [vSAN@6876 sub=CnsCatalogSvc opId=vsan-wfu-2b03] CNS: CatalogService is initialized successfully
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580211] [vSAN@6876 sub=VpxdCnx] Login to the destination, SessionKey: 5237f67a-ae2b-fdd1-51c8-463a17b87ff5
2025-03-07T04:32:17.410Z info vsanvcmgmtd[580211] [vSAN@6876 sub=VpxdCnx] Recovered session, sid: 1, recoverRequestOnly:false
2025-03-07T04:32:17.416Z info vsanvcmgmtd[580113] [vSAN@6876 sub=pcs[0]] Registered listener '[CnsHostListener:0x000055a537768580]'
2025-03-07T04:32:17.420Z info vsanvcmgmtd[580244] [vSAN@6876 sub=CnsCatalogSvc] Find file service cluster vim.ClusterComputeResource:domain-c138 for datastore ds:///vmfs/volumes/vsan:5253xxxxxxxx6ec0-afcc089xxxxxxxx1/
2025-03-07T04:32:17.421Z info vsanvcmgmtd[580211] [vSAN@6876 sub=PropertyCollectorService] CNS: Finish gathering CNS Tasks. Total=16, CNS=1
2025-03-07T04:32:17.421Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Total 1 old CNS tasks are found
2025-03-07T04:32:17.425Z info vsanvcmgmtd[580244] [vSAN@6876 sub=PyCppVmomi] Initialized python thread state 0x00007fa3943c3290.
2025-03-07T04:32:17.428Z info vsanvcmgmtd[580222] [vSAN@6876 sub=pcs[0]] Started listerner '[CnsHostListener:0x000055a537768580]'
2025-03-07T04:32:17.430Z info vsanvcmgmtd[580211] [vSAN@6876 sub=CnsTask] Old task=(vim.TaskInfo) {
-->    key = "task-47743075",
-->    task = 'vim.Task:8cbe3917-25c7-4cdc-a28f-53ece89a068e:task-47743075',
-->    descriptionId = "com.vmware.cns.tasks.updatevolume",
-->    entity = 'vim.Folder:8cbxxxxx-25xx-4xxc-xx8f-53ecxxxxxx8e:group-d1',
-->    entityName = "Datacenters",
-->    state = "running",
-->    cancelled = false,
-->    cancelable = false,
-->    error = (vmodl.fault.SystemError) {
-->       reason = "Failing pending CNS tasks during startup",
-->       msg = "",
-->    },
-->    progress = 0,
-->    reason = (vim.TaskReasonUser) {
-->       userName = "com.vmware.cns"
-->    },
-->    queueTime = "2025-03-07T03:35:49.30836Z",
-->    startTime = "2025-03-07T03:35:49.31687Z",
-->    eventChainId = 74107978,
-->    activationId = "3ae1c86d",
--> }

Environment

Cloud Native Storage

vSphere with Tanzu

Cause

CNS periodic sync, which runs at 1 min interval, removes the deleted volumes from CNS DB. For each datastore, FCD on the hostd side maintains a folder called catalog which serves as source of truth for CNS. Catalog records all the volume operations such as create, delete, update occurring on the datastore. CNS periodic sync reads all these records from catalog and processes them.

The system is creating and deleting volumes at high frequency which are recorded in the catalog. CNS is continuously reading these records as catalog keeps adding records. CNS never reaches the processing stage of periodic sync. Hence, the stale volumes are not removed from CNS DB and keep increasing in numbers.

When the system has too many volumes in CNS DB and CSI calls QueryVolume, vsanvcmgmtd crashes as the SQL query used by QueryVolume has too many volume ids in the where clause.

Resolution

QueryVolumes causing vsanvcmgmtd to crash has been fixed in 9.0.

The workaround is to reduce workload on CNS by scaling csi controller to 0

Scale CSI controller deployment to 0.

kubectl -n vmware-system-csi scale deployment vsphere-csi-controller --replicas=0
Start vsan-health if it is stopped and wait for periodic sync in CNS to catch up reading all the records and remove the stale volumes from DB.

Starting vsan-health

vmon-cli --start vsan-health
The number of stale volumes can be monitored using below query

select count(volume_id) from cns.volume_info v where v.mark_for_delete=true;
Scale back CSI controller

kubectl -n vmware-system-csi scale deployment sphere-csi-controller --replicas=3

After the workaround, if the workload continues to be heavy then eventually the number of stale volumes in CNS DB will keep increasing and the same situation will arise. Its better to split the workload across datastores so that CNS can catch up reading the records from catalog.