"Attach container volumes" tasks on vSphere client fail with error "The object or item referred to could not be found"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

The volume attachment tasks (attach tasks on the vCenter server) pop up continuously on the vSphere Client and fail with error- "The object or item referred to could not be found".
Per the vsan-health logs, the CNS service is busy processing other tasks in queue and couldn't complete the attach task as a result. This is seen for multiple task IDs.

info vsanvcmgmtd[] [vSAN@6876 sub=CnsTask opID=3255d46f] A com.vmware.cns.tasks.attachvolume task is created: task-<ID>
info vsanvcmgmtd[] [vSAN@6876 sub=WorkflowManager opID=<ID>] At
tach volume task conflicting with resource <ID>. 0 tasks are already in queue
info vsanvcmgmtd[] [vSAN@6876 sub=VsanTaskSvc opID=<ID>] ADD pu
blic task 'task-<ID>', total:
info vsanvcmgmtd[] [vSAN@6876 sub=AdapterServer opID=<ID>] Finished 'attach' on 'cns-volume-manager' (4 ms): done
info vsanvcmgmtd[] [vSAN@6876 sub=AdapterServer opID=3255d470] Invoking 'attach' on 'cns-volume-manager' session '<ID>' active 1/1
info vsanvcmgmtd[] [vSAN@6876 sub=CnsVolMgr opID=3255d470] Attaching volume with spec: (vim.cns.VolumeAttachDetachSpec) [
--> (vim.cns.VolumeAttachDetachSpec) {
--> volumeId = (vim.cns.VolumeId) {
--> id = "<volume-handle-id>"
--> },
--> vm = 'vim.VirtualMachine:vm-<ID>'
--> }
--> ]
info vsanvcmgmtd[] [vSAN@6876 sub=CnsTask opID=<ID>] A com.vmware.cns.tasks.attachvolume task is created: task-<ID>
info vsanvcmgmtd[] [vSAN@6876 sub=WorkflowManager opID=<ID>] Attach volume task conflicting with resource vm-<ID>. <number of pending tasks> tasks are already in queue
info vsanvcmgmtd[] [vSAN@6876 sub=VsanTaskSvc opID=<ID>] ADD public task 'task-<ID>', total: <number of pending tasks>
Inside the VCDB, all the volume attach tasks are stuck in "queued" state. Below is how the attach volume tasks look like in the vpx_task table.

VCDB=# select * from vpx_task where task_id=<ID obtained from CNS logs in step 2>;
task_id | name | descriptionid | entity_id | entity_type | entity_name | locked_data | complete_state | cancelled | cancellable | error_data | result_data | progress |
reason_data | queue_time | start_time | complete_time | event_chain_id | username | vm_
id | host_id | computeresource_id | datacenter_id | resourcepool_id | folder_id | alarm_id | scheduledtask_id | change_tag_id | parent_task_id | root_task_id | description | activation_id | continuous | no_of_reattempts | preserved_session_uuid | activation_meth
od_name | activation_method_arguments
-----------+------+-----------------------------------+-----------+-------------+--------------------------------------------+-------------+----------------+-----------+-------------+------------+-------------+----------+------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+------------+---------------+----------------+----------------+----
----+---------+--------------------+---------------+-----------------+-----------+----------+------------------+---------------+----------------+--------------+-------------+---------------+------------+------------------+------------------------+----------------
--------+-----------------------------
<ID>| | com.vmware.cns.tasks.detachvolume | <ID> | 0 | <Nodepool-ID> | | queued | 0 | 0 | | | | <obj xmlns:xsd="http://www.w3.org/2001/XM
LSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:vim25" versionId="8.0.3.0" xsi:type="TaskReasonUser"><userName>com.vmware.cns</userName></obj> | <date and time> | | | <session_id>| com.vmware.cns | 195
978 | 1141 | 1003 | 3 | | | | | | | | | <ID> | 0 | 0 | |
Per vpxd, the disk to be attached is not found followed by a SOAP connection failure from the vpxa service.

error vpxd[] [Originator@6876 sub=Default opID=<ID>] [VpxLRO
] -- ERROR task-<ID>-- <ID>(<ID>)
-- vm-<ID>-- vim.VirtualMachine.attachDisk: :vim.fault.NotFound
--> Result:
--> (vim.fault.NotFound) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>
--> msg = "Received SOAP response fault from [<<io_obj p:0x00007f94a89d26c8, h:36, <UNIX ''>, <UNIX
'/var/run/envoy-hgw/hgw-pipe'>>, /hgw/host-<ID>/vpxa>]: retrieveVStorageObjectPathAndCrypto
--> Received SOAP response fault from [<<io_obj p:0x0000005cba549810, h:13, <TCP '127.0.0.1 : 50660'>,
<TCP '127.0.0.1 : 8307'>>, /sdk>]: retrieveVStorageObjectPathAndCrypto
--> The object or item referred to could not be found."
--> }
--> Args:
-->
--> Arg diskId:
--> (vim.vslm.ID) {
--> id = "<ID>"
--> }
--> Arg datastore:
--> 'vim.Datastore:datastore-<ID>'
--> Arg controllerKey:
--> 1000
--> Arg unitNumber:)

Environment

vSphere Kubernetes Service

Cause

When a high volume of com.vmware.cns.tasks.attachvolume or detachvolume operations are initiated simultaneously, the CNS service queues these requests to prevent resource contention.

However, if the task volume exceeds the processing capacity or if a specific resource (such as a Virtual Machine) becomes a point of contention, the tasks remain in a queued state within the vpx_task table of the vCenter Database (VCDB). The "object or item referred to could not be found" error typically occurs as a secondary failure when the vSphere Client or a calling service (like the CSI driver) times out or attempts to reference a task/object that has been purged or superseded while the original operation was still blocked in the CNS queue.

Resolution

The below workaround reduces the workload on CNS service by allowing it to catch up reading all the records.

Scale CSI controller deployment down to 0.

kubectl -n vmware-system-csi scale deployment vsphere-csi-controller --replicas=0
Wait for periodic sync in CNS to catch up reading all the records and remove the stale volumes from DB. The number of stale volumes can be monitored using the below query.

select count(volume_id) from cns.volume_info where mark_for_delete=true;
Once the volumes marked for deletion are down to zero, scale up the CSI controller.

kubectl -n vmware-system-csi scale deployment vsphere-csi-controller --replicas=3