Repeated error on vCenter: The object or item referred to could not be found.

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Running a backup of Kubernetes objects with velero and CSI-Driver using datamanager for snapshot creation

Having the snapshot-feature configured so it takes a snapshot of specific namespaces and their corresponding PV's (vsphere-volumes). Following https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid-integrated-edition/1-22/tkgi/velero-install-vsphere.html

From vCenter there is an error that a snapshot could not be deleted however searching of this object there is no VCF (fist class disk) with this name.

Environment

TKGi 1.22

Velero 1.15.2

vCenter 8.x

Cause

Under velero namespace for each scheduled backup for each namespace that have PVC on it there is an object created called snapshots, based on https://velero.io/docs/v1.15/csi-snapshot-data-movement/#backup-deletion

When a backup is created, a snapshot is saved into the repository for the volume data. The snapshot is a reference to the volume data saved in the repository.
When deleting a backup, Velero calls the repository to delete the repository snapshot. So the repository snapshot disappears immediately after the backup is deleted. Then the volume data backed up in the repository turns to orphan, but it is not deleted by this time. The repository relies on the maintenance functionality to delete the orphan data.

It seems that the related objects in vCenter were removed and when the backup deletion/cleanup schedule came in place to remove the older backups the related snapshots could not be deleted

A series of deletesnapshots objects in kubernetes have been created that were older than 30 days (older than the last kept backup). Datamanager is retrying to delete these objects, but receives 404 not found as the FCD in vCenter is not found.

To confirm this is the problem (scale datamanager to 0 or temporarily delete the daemonset) if the errors in the vCenter stops this is confirmation the problem is related to the above situation.

Messages visible on the datamanager seems to be retriable leading to increased number of errors in vCenter:

2025-06-16T12:49:07.738538162Z stdout F time="2025-06-16T12:49:07Z" level=info msg="There is a temporary catalog mismatch due to a race condition with one another concurrent DeleteSnapshot operation. And it will be resolved by the next consolidateDisks operation on the same VM. Will NOT retry" error=NotFound logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/ivd/ivd_protected_entity.go:439"

Resolution

Get all deletesnapshots operations:

kubectl get deletesnapshots -A --sort-by=.metadata.creationTimestamp

Verify the output and confirm if the objects pending to be deleted are older than the current backup retention

kubectl get backups -A --sort-by=.metadata.creationTimestamp

The objects if velero is configured for 30 days retention the oldest object should be 29 days old

Verify velro snapshots and if they match the age of the backups

kubectl get snapshots -A --sort-by=.metadata.creationTimestamp

If the objects deletenapshots should not persist for too long, once the backup is removed the corresponding snapshots will also be deleted, if there is an error like above not found the datamanger will retry the deletion

Describing of the old deletesnapshots would have status corresponding to not found

Deleting these objects and restarting the datamanager should clean the errors.

Additional Information

Please note there could be other possible reasons for above error and in case the symptoms does not match please reach out to support for assistance.