Running a backup of Kubernetes objects with velero and CSI-Driver using datamanager for snapshot creation
Having the snapshot-feature configured so it takes a snapshot of specific namespaces and their corresponding PV's (vsphere-volumes). Following https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid-integrated-edition/1-22/tkgi/velero-install-vsphere.html
From vCenter there is an error that a snapshot could not be deleted however searching of this object there is no VCF (fist class disk) with this name.
TKGi 1.22
Velero 1.15.2
vCenter 8.x
Under velero namespace for each scheduled backup for each namespace that have PVC on it there is an object created called snapshots, based on https://velero.io/docs/v1.15/csi-snapshot-data-movement/#backup-deletion
When a backup is created, a snapshot is saved into the repository for the volume data. The snapshot is a reference to the volume data saved in the repository.
When deleting a backup, Velero calls the repository to delete the repository snapshot. So the repository snapshot disappears immediately after the backup is deleted. Then the volume data backed up in the repository turns to orphan, but it is not deleted by this time. The repository relies on the maintenance functionality to delete the orphan data.
It seems that the related objects in vCenter were removed and when the backup deletion/cleanup schedule came in place to remove the older backups the related snapshots could not be deleted
A series of deletesnapshots objects in kubernetes have been created that were older than 30 days (older than the last kept backup). Datamanager is retrying to delete these objects, but receives 404 not found as the FCD in vCenter is not found.
To confirm this is the problem (scale datamanager to 0 or temporarily delete the daemonset) if the errors in the vCenter stops this is confirmation the problem is related to the above situation.
Messages visible on the datamanager seems to be retriable leading to increased number of errors in vCenter:
2025-06-16T12:49:07.738538162Z stdout F time="2025-06-16T12:49:07Z" level=info msg="There is a temporary catalog mismatch due to a race condition with one another concurrent DeleteSnapshot operation. And it will be resolved by the next consolidateDisks operation on the same VM. Will NOT retry" error=NotFound logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/ivd/ivd_protected_entity.go:439"
Get all deletesnapshots operations:
kubectl get deletesnapshots -A --sort-by=.metadata.creationTimestamp
Verify the output and confirm if the objects pending to be deleted are older than the current backup retention
kubectl get backups -A --sort-by=.metadata.creationTimestamp
The objects if velero is configured for 30 days retention the oldest object should be 29 days old
Verify velro snapshots and if they match the age of the backups
kubectl get snapshots -A --sort-by=.metadata.creationTimestamp
If the objects deletenapshots should not persist for too long, once the backup is removed the corresponding snapshots will also be deleted, if there is an error like above not found the datamanger will retry the deletion
Describing of the old deletesnapshots would have status corresponding to not found
Deleting these objects and restarting the datamanager should clean the errors.
Please note there could be other possible reasons for above error and in case the symptoms does not match please reach out to support for assistance.