VCDB is out of sync with datastore catalog

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Use this KB to trigger a full sync of datastores to bring the VCDB in sync with a datastore catalog.

Symptoms:

- There is a repetitive task in the vCenter says "The object or item referred to couldn't be found".

- This task could be triggered after expanding, deleting or re-attaching a PVC.

- In the vsanvcmgmtd.log logs:

2023-05-30T13:38:11.751+02:00 info vsanvcmgmtd[32316] [vSAN@6876 sub=CnsVolMgr opId=bf0899c9] CNS: UpdateVolumeMetadata with spec: (vim.cns.VolumeMetadataUpdateSpec) [
-->  (vim.cns.VolumeMetadataUpdateSpec) {
-->    volumeId = (vim.cns.VolumeId) {
-->     id = "########-####-####-####-############"
-->    }
...
2023-05-30T13:38:12.073+02:00 error vsanvcmgmtd[41223] [vSAN@6876 sub=TaskService opId=bf0899c9] CNS: Async task vslm.Task:VslmTask-1075667 finished with fault (vim.fault.NotFound) {
-->  faultCause = (vmodl.MethodFault) null,
-->  faultMessage = <unset>
-->  msg = "The object or item referred to could not be found."
--> }

- There is a discrepancy in the VCDB, as it references the volume on a datastore that is different than the one in which it actually resides. In this example, it references the volume on datastore "linux-005". However, when we separate the first part of the volume ID as ("29 7e 2c 21") and use it to search all datastores for the vmdk file that is backing the volume, we find it residing on datastore "linux-001".

root@vcenter [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -t -c "select row_to_json(vpx_storage_object_info) from vpx_storage_object_info where id='########-####-####-####-############'" | jq -r
{
"id": "########-####-####-####-############",
"name": "pvc-########-####-####-####-############",
"capacity": 1024,
"datastore_url": "ds:///vmfs/volumes/########-####-####-####-############/",
"create_time": "2023-02-08 12:34:38.012",
"v_clock": 232531,
"backing_object_id": "",
"disk_path": "[linux-005] prd-default-########-####-####-####-############/prd-default-########-####-####-####-############.vmdk",
"used_capacity": -1
}

root@esx-03:~] IFS=$'\n'; for i in `find /vmfs/volumes -iname '*.vmdk' -type f |grep -vE "flat|sesparse|delta|rdm|ctk"`; do echo $i; grep 'fcd.uuid' $i ;done | grep -B1 "29 7e 2c 21"
/vmfs/volumes/########-####-####-####-############/fcd/prd-default-########-####-####-####-############.vmdk
ddb.fcd.uuid = "2x 7e 2c 2x 18 cd 40 3x-b1 50 c5 fc d5 d0 fa 2x"

[root@esx-03:~] localcli storage filesystem list | grep -i "########-####-####-####-############"
/vmfs/volumes/########-####-####-####-############ linux-001 ########-####-####-####-############ true VMFS-6 5497289703424 2949391581184

- Reconciling the catalogs of the source and destination datastores doesn't help because VCDB has a higher v_clock value than what is there in the datastore's catalogs:

[root@esx-03:~] ls -l /vmfs/volumes/linux-005/catalog/vclock
total 0
-rwxr-xr-x 1 root root 0 Dec 22 11:56 vclock-756

[root@esx-03:~] ls -l /vmfs/volumes/linux-001/catalog/vclock
total 0
-rwxr-xr-x 1 root root 0 Dec 22 10:34 vclock-3959

Environment

VMware vSphere 7.0 with Tanzu

Cause

The two datastores were associated with the same datastore cluster, and Storage DRS is turned on. Thus, the volume has been migrated by the Storage DRS. However, the vSphere Container Storage Plug-in doesn't support Storage DRS.

Resolution

- Turn off Storage DRS to avoid further discrepancies in the VCDB.

- Trigger a full sync for the source and destination datastores respectively from VSLM to bring VCDB in sync with the datastore catalog.

1. Copy the URL of the datastore from VC UI → select Datastore → copy URL from the summary
2. Go to vslm mob (<vc_ip>/vslm/mob)
3. Go to content→StorageLifeCycleManager
4. Click on VslmSyncDatastore
5. Type the value for datastoreUrl from step 1
6. Set fullSync value to true
7. Remove the contents of fcdId argument
8. Click on Invoke Method

- Confirm that the VCDB is currently referring to the volume where it actually resides.

root@vcenter [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -t -c "select row_to_json(vpx_storage_object_info) from vpx_storage_object_info where id='########-####-####-####-############'" | jq -r
{
"id": "########-####-####-####-############",
"name": "pvc-########-####-####-####-############",
"capacity": 1024,
"datastore_url": "ds:///vmfs/volumes/########-####-####-####-############",
"create_time": "2023-02-08 12:34:38.012",
"v_clock": 3960,
"backing_object_id": "",
"disk_path": "[linux-001] fcd/prd-default-########-####-####-####-############.vmdk",
"used_capacity": -1
}

Additional Information

- The TKC nodes and additional mounted volumes such as etcd, containerd and kubelet could also be impacted by this issue.

- In the below example, the Storage DRS migrated the volume that was backing the container from datastore "datastore01" to datastore "datastore00".

root@vcenter [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -t -c "select row_to_json(vpx_storage_object_info) from vpx_storage_object_info where id='e598004e-7bad-49ae-851e-e37394e3b5f2'" | jq -r
{
"id": "e598004e-7bad-49ae-851e-e37394e3b5f2",
"name": "pvc-d390c604-a5bc-4ace-b3aa-e1ffad4857ec",
"capacity": 16384,
"datastore_url": "ds:///vmfs/volumes/638edfad-5ffa3e83-94a9-1070fdb2fe96/",
"create_time": "2023-03-07 14:21:57.809",
"v_clock": 131,
"backing_object_id": "",
"disk_path": "[datastore01] worker-vm-01-kfh6k-7bdc95f96b-svnnh/worker-vm-01-kfh6k-7bdc95f96b-svnnh_3.vmdk",
"used_capacity": -1
}

[root@esx-01:~] IFS=$'\n'; for i in `find /vmfs/volumes -iname '*.vmdk' -type f |grep -vE "flat|sesparse|delta|rdm|ctk"`; do echo $i; grep 'fcd.uuid' $i ;done | grep -B1 "e5 98 00 4e"
/vmfs/volumes/638edf99-2a8a7eed-0a4c-1070fdb2fe96/worker-vm-01-kfh6k-7bdc95f96b-p2ws6/worker-vm-01-kfh6k-7bdc95f96b-p2ws6_3.vmdk
ddb.fcd.uuid = "e5 98 00 4e 7b ad 49 ae-85 1e e3 73 94 e3 b5 f2"

[root@esx-01:~] localcli storage filesystem list | grep -i "638edf99-2a8a7eed-0a4c-1070fdb2fe96"
/vmfs/volumes/638edf99-2a8a7eed-0a4c-1070fdb2fe96 datastore00 638edf99-2a8a7eed-0a4c-1070fdb2fe96 true VMFS-6 5497289703424 2949391581184

- After triggering a full sync for source and destination datastores from VSLM, the VCDB will be in sync with the datastore catalog.

root@vcenter [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -t -c "select row_to_json(vpx_storage_object_info) from vpx_storage_object_info where id='e598004e-7bad-49ae-851e-e37394e3b5f2'" | jq -r
{
"id": "e598004e-7bad-49ae-851e-e37394e3b5f2",
"name": "pvc-d390c604-a5bc-4ace-b3aa-e1ffad4857ec",
"capacity": 16384,
"datastore_url": "ds:///vmfs/volumes/638edf99-2a8a7eed-0a4c-1070fdb2fe96/",
"create_time": "2023-03-07 14:21:57.809",
"v_clock": 3960,
"backing_object_id": "",
"disk_path": "[datastore00] worker-vm-01-kfh6k-7bdc95f96b-p2ws6/worker-vm-01-kfh6k-7bdc95f96b-p2ws6_3.vmdk",
"used_capacity": -1
}