info vsanvcmgmtd[23472] [vSAN@6876 sub=CnsTask opID=<ID>] A com.vmware.cns.tasks.detachvolume task is created: task-<ID> info vsanvcmgmtd[23472] [vSAN@6876 sub=FcdService opID=<ID>] Volume <ID> is attached to vm vm-<ID> info vsanvcmgmtd[23472] [vSAN@6876 sub=WorkflowManager opID=<ID>] Detach volume task conflicting with resource <ID>. <number of pending tasks> tasks are already in queue info vsanvcmgmtd[23472] [vSAN@6876 sub=VsanTaskSvc opID=<ID>] ADD public task 'task-<ID>', total: 930721 info vsanvcmgmtd[23472] [vSAN@6876 sub=AdapterServer opID=<ID>] Finished 'detach' on 'cns-volume-manager' (60 ms): done
delete = true,
clusterId = "<ID of the cluster object in the vCenter Server>",
entityType = "PERSISTENT_VOLUME_CLAIM",
namespace = "ai",
referredEntity = (vim.cns.KubernetesEntityReference) [
(vim.cns.KubernetesEntityReference) {
entityType = "PERSISTENT_VOLUME",
entityName = "<Name of the persistent volume>",
clusterId = "<ID of the cluster object in the vCenter Server>",
Name: pvc-<ID>Labels: <none>Annotations: pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com volume.kubernetes.io/provisioner-deletion-secret-name: volume.kubernetes.io/provisioner-deletion-secret-namespace:Finalizers: [kubernetes.io/pv-protection external-attacher/csi-vsphere-vmware-com]StorageClass: <storage class name>Status: Released
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete #m#s (x495 over 2d19h) csi.vsphere.vmware.com_vsphere-csi-controller-<ID> rpc error: code = Internal desc = persistentVolumeClaim: <Volume Handle ID of the PV> on namespace: <namespace> inside the supervisor cluster where the guest cluster is deployed> in supervisor cluster was not deleted. Error: persistentVolumeClaim <namespace>/<volume handle ID> is not deleted within 240 seconds: message: unable to fetch PersistentVolumeClaim <namespace>/<volume handle ID> with err: client rate limiter Wait returned an error: context deadline exceeded.
<namespace> persistentvolumeclaim/<PVC Name/Volume Handle ID> Terminating pvc-<ID> <capacity> <provisioning type> root@<ID of supervisor VM> [ ~ ]#root@<ID of supervisor VM>[ ~ ]#root@<ID of supervisor VM> [ ~ ]#{"level":"error","time":"<date>T<time>","caller":"cnsnodevmattachment/cnsnodevmattachment_controller.go:584","msg":"failed to detach disk: \"<ID>\" to nodevm: VirtualMachine:vm-<ID> [VirtualCenterHost: <vCenter Server Hostname>, UUID: <ID>, Datacenter: Datacenter [Datacenter: Datacenter:<ID>, VirtualCenterHost: pctvcenter.yardipc.com]] for CnsNodeVmAttachment request with name: \"<CnsNodeVmAttachmentID>\" on namespace: \"<namespace>\". Err: time out for task Task:task-<ID> before response from CNS","TraceId":"<ID>","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer/cnsoperator/controller/cnsnodevmattachment.(*ReconcileCnsNodeVMAttachment).Reconcile.func1\nVCDB=# select * from vpx_task where task_id=<ID obtained from CNS logs in step 2>;task_id | name | descriptionid | entity_id | entity_type | entity_name | locked_data | complete_state | cancelled | cancellable | error_data | result_data | progress | reason_data | queue_time | start_time | complete_time | event_chain_id | username | vm_id | host_id | computeresource_id | datacenter_id | resourcepool_id | folder_id | alarm_id | scheduledtask_id | change_tag_id | parent_task_id | root_task_id | description | activation_id | continuous | no_of_reattempts | preserved_session_uuid | activation_method_name | activation_method_arguments-----------+------+-----------------------------------+-----------+-------------+--------------------------------------------+-------------+----------------+-----------+-------------+------------+-------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+------------+---------------+----------------+----------------+--------+---------+--------------------+---------------+-----------------+-----------+----------+------------------+---------------+----------------+--------------+-------------+---------------+------------+------------------+------------------------+------------------------+----------------------------- <ID>| | com.vmware.cns.tasks.detachvolume | <ID> | 0 | <Nodepool-ID> | | queued | 0 | 0 | | | | <obj xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:vim25" versionId="8.0.3.0" xsi:type="TaskReasonUser"><userName>com.vmware.cns</userName></obj> | <date and time> | | | <session_id>| com.vmware.cns | 195978 | 1141 | 1003 | 3 | | | | | | | | | <ID> | 0 | 0 | |
VMware vSphere CSI Driver
VMware vSphere Kubernetes Service
When the volumes are deleted in CNS, they're not actually removed, the 'mark_for_delete' flag is set to true and the CNS service lets the periodic sync handle the deletion. However, in the event the vsan-health service experiences a very high load with creates and deletes both happening every few seconds, the periodic sync is stuck looping and fetching catalog changes. As a result, CNS is stuck fetching the changes over and over and never goes to the processing stage.
This issue is addressed in vSphere 9.0 and above.
In case the vCenter Server is in 8.x or less, the below workaround shall reduce the workload on CNS service by letting it to catch up reading all the records and remove the stale volumes from DB.
kubectl -n vmware-system-csi scale deployment vsphere-csi-controller --replicas=0
select count(volume_id) from cns.volume_info where mark_for_delete=true;
kubectl -n vmware-system-csi scale deployment vsphere-csi-controller --replicas=3