Snapshotted Cloud Native Storage volume disappears from UI after attempting delete

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

Typically in /var/log/vmware/vsan-health/vsanvcmgmtd.log the below log is found:

2023-02-08T21:49:12.482Z error vsanvcmgmtd[44176] [vSAN@6876 sub=Workflow opId=462ac6db] Workflow current action has fault (vim.fault.CnsFault) {
faultCause = (vmodl.MethodFault) null,
faultMessage = <unset>,
reason = "Failed to delete volume. Error: (vmodl.fault.InvalidArgument) {
faultCause = (vmodl.MethodFault) null,
faultMessage = (vmodl.LocalizableMessage) [
    (vmodl.LocalizableMessage) {
     key = "com.vmware.vim.fcd.error.snapshotsNotAllowed",
     arg = (vmodl.KeyAnyValue) [
       (vmodl.KeyAnyValue) {
        key = "snapshot",
        value = "XXX"
       }
     ],
     message = "Cannot be performed on FCD with snapshots. Snapshot XXX relies on this FCD"
    }
],
invalidProperty = "id"
msg = "A specified parameter was not correct: id"
}"
msg = ""
}

And the volume disappears from CNS UI and cns query api.

Verify the DB content to observe the field is marked as True with the below SQL

select mark_for_delete from cns.volume_info where volume_id ='missing_pv_id';

You see the following in the vpxd log

2024-10-08T14:36:52.2172 info vpxd[08024) [Originator@6876 sub=Default opID=opId [VpxLRO) -- ERROR task-XXX -- VStorageObjectManager -- vim.vslm.vcenter.VStorageObjectManager.deleteVStorageObjectEx: vmodl.fault.InvalidArgument:
Result:
(vmodl.fault.InvalidArgument) {
	faultCause = (vmodl.MethodFault) null,
	faultMessage = (vmodl.LocalizableMessage) [
		(vmodl. LocalizableMessage)  {
			key = "com.vnware.vim.fcd.error.snapshotsNotAllowed",
			arg = (vmodl.KeyAnyValue) [
				(vmodl.KeyAnyValue) {
					key = "snapshot",
					value = "XXX"
				}
				message = "Cannot be performed on FCD with snapshots. Snapshot XXX relies on this FCD"
		}
	], 
	invalidProperty = "id"
	msg = "A specified parameter was not correct: id"
}
Args:
Arg id:
(vim. vsIm. ID) {
	id = "80f6f9a8-XXX-YYY-ZZZ-d4a87c8d3bf3"
｝
Arg datastore:
'vim.Datastore:datastore-XX'

Environment

VMware vCenter Server 7.0.x

Cause

Due to an internal sync logic, a volume gets marked as pending delete before the delete call to lower layers return successful. When there is snapshot on this volume, lower layers throw exception that is not handled successfully. The volume gets stuck in pending delete state.

Resolution

The issue has been resolved in vCenter server 8.0 and higher versions.

Workaround:

To workaround the issue, run the below sql command after logging in to vCenter DB:

update cns.volume_info set mark_for_delete=false where volume_id='missing_pv_id';

If the "mark_for_delete" flag is returned as "f", you may need to delete snapshots associated with a given PVC like so;

1. List all the snapshots of the fcd using RetrieveSnapshotInfo() api.
     a. Go to vc mob: https://<vcip>/mob/?moid=VStorageObjectManager&method=retrieveSnapshotInfo
     b. Enter the fcdid and datastore mob.
     c. Invoke Method.
     d. It should return all snapshot ids of the fcd.
2. One by one, delete all the snapshots of the fcd.
     a. Go to vc mob: https://<vc-ip>/mob/?moid=VStorageObjectManager&method=deleteSnapshot
     b. Enter fcdid, datastore mob and snapshotid.
     c. Invoke method.
     d. Wait for the task to succeed and then call for another fcd.

This will bring back the volume from pending delete state and appear in CNS UI and query api.