This article covers a specific corner case where CNS tasks are hung in vCenter Server after a volume is detached from a virtual machine.
Note: All the below symptoms must match to conclude the task is indeed hung. Steps 6 and 8 should match to confirm the issue. Steps 1-5, 7 shows how to get relative log messages.
Task ID in the vsansvcmgmtd.log file.Task ID 1722, the below event is recorded along with a corresponding task: deletesnapshot taskLog file: /var/log/vmware/vsan-health/vsanvcmgmtd.log
YYYY-MM-DDT16:55:59.333Z INFO vsanvcmgmtd 52040 [vc@4413 sub="CnsTask" opId="2cbc21c4"] A com.vmware.cns.tasks.deletesnapshot task is created: task-1722
Make a note of the opId from the above log. For eg. opId="2cbc21c4"
Using the opId 2cbc21c4, find below log entry in vsanvcmgmtd.logThe log will indicate the number of tasks in queue. For eg. 1 tasks are already in queue
Log file: /var/log/vmware/vsan-health/vsanvcmgmtd.log
YYYY-MM-DDT16:55:59.338Z INFO vsanvcmgmtd 52040 [vc@4413 sub="WorkflowManager" opId="2cbc21c4"] Delete Snapshot task conflicting with resource vm-1087. 1 tasks are already in queue
Identify the VM moid from the above log file. For eg. vm-1087
Using the VM moid from step 5, search the vsanvcmgmtd logs for the latest events containing, "resource vm-###". The log event MUST indicate more than 0 task in queue.
$> grep "resource vm-1087" vsanvcmgmtd-3.log
YYYY-MM-DDT15:54:20.578Z INFO vsanvcmgmtd 52067 [vc@4413 sub="WorkflowManager" opId="2cbc1b5a"] Detach volume task conflicting with resource vm-1087. 0 tasks are already in queue
YYYY-MM-DDT15:54:20.892Z INFO vsanvcmgmtd 52121 [vc@4413 sub="WorkflowManager" opId="2cbc1b58"] invoking next workflow Detach volume pending on resource vm-1087
YYYY-MM-DDT15:54:28.791Z INFO vsanvcmgmtd 52024 [vc@4413 sub="WorkflowManager" opId="2cbc1b78"] Update volume task conflicting with resource vm-1087. 0 tasks are already in queue
YYYY-MM-DDT15:54:28.847Z INFO vsanvcmgmtd 51854 [vc@4413 sub="WorkflowManager" opId="2cbc1b6f"] invoking next workflow Update volume pending on resource vm-1087
YYYY-MM-DDT16:55:59.143Z INFO vsanvcmgmtd 52131 [vc@4413 sub="WorkflowManager" opId="2cbc21c1"] Delete Snapshot task conflicting with resource vm-1087. 0 tasks are already in queue
YYYY-MM-DDT16:55:59.338Z INFO vsanvcmgmtd 52040 [vc@4413 sub="WorkflowManager" opId="2cbc21c4"] Delete Snapshot task conflicting with resource vm-1087. 1 tasks are already in queue
YYYY-MM-DDT16:55:59.845Z INFO vsanvcmgmtd 52132 [vc@4413 sub="WorkflowManager" opId="2cbc21be"] invoking next workflow Delete Snapshot pending on resource vm-1087
YYYY-MM-DDT16:55:59.845Z INFO vsanvcmgmtd 52132 [vc@4413 sub="WorkflowManager" opId="2cbc21be"] invoking next workflow Delete Snapshot pending on resource vm-1087opId 2cbc21be - Invoking 'detach' on 'cns-volume-manager'
YYYY-MM-DDT16:55:58.846Z INFO vsanvcmgmtd 52041 [vc@4413 sub="AdapterServer" opId="2cbc21be"] Invoking 'detach' on 'cns-volume-manager' session '528a5549-302b-2590-8b4f-a1d85110e303' active 1/1
vSphere 8.0 update 3 and later
In rare circumstances, CNS tasks are in a hung state as detaching a volume triggers a status change that clears the VM's task queue prematurely.
This issue will be fixed in a future release of vCenter Server.
To work around the issue, manually restart the vsan-health service on vCenter Server.
Note: Restarting the vsan-health service will force queued tasks to fail in vCenter Server. As a result of queued tasks failing in vCenter Server, all on-going k8s tasks may be forced to fail as well.
Restart the vsan-health service using the below command:
$ service-control --stop vmware-vsan-health && service-control --start vmware-vsan-health