Intermittent backup failures when backing up Persistent Volume Claims for Guest clusters

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Backup operations to protect Persistent Volume Claims associated with Guest cluster failing intermittently.
A delay occurs when Commvault attempts to provision a temporary worker pod and Persistent Volume Claim to read data from a snapshot volume.
On checking persistentvolumeclaims and the pod from Guest cluster are stuck in Pending state.
- kubectl get pvc -n <namespace>

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
######## - #### - #### - ############ Bound pvc- ######## - #### - #### - ############ 8Gi RWO storage-class 10d
######## - #### - #### - ############ Pending pvc- ######## - #### - #### - ############ 10Gi RWO storage-class 3m52s

- kubectl get pods -n <namespace>

NAME READY STATUS RESTARTS AGE IP NODE
######## - #### - #### - ############ 0/1 Pending 0 3m52s <none> ######## - #### - #### - ############
######## - #### - #### - ############ 1/1 Running 0 45h <none> ######## - #### - #### - ############
######## - #### - #### - ############ 1/1 Running 0 45h <none> ######## - #### - #### - ############

Guest cluster CSI provisioner logs show a volume provisioning failure

/var/log/pods/vmware-system-csi_vsphere-csi-controller-########-########/csi-provisioner/ 0.log

YYYY-MM-DDTHH:MM:SS.672945273Z stderr F E1228 HH:MM:SS controller.go:957] error syncing claim "######## - #### - #### - ############ ": failed to provision volume with StorageClass "<Storage-class-name>": rpc error: code = Internal desc = failed to create volume on namespace: <namespace> in supervisor cluster. Error: persistentVolumeClaim ######## - #### - #### - ############ in namespace <namespace> not in phase Bound within 240 seconds. reason: failed to provision volume with StorageClass "Storage-class-name": rpc error: code = Internal desc = failed to create volume. Error: failed to get the compatible datastore for create volume from snapshot ######## - #### - #### - ############ with error: <nil>

Commvault's backup log:

1627535 18d7f4 MM/YY HH:MM:SS 74286 CK8sInfo::OpenVmdk() - Failed to create worker [<worker-name>] for app [### PersistentVolumeClaim ######## - #### - #### - ############ ].
1627535 18d7f4 MM/YY HH:MM:SS 74286 CK8sInfo::SetLastVMErrorFromQiError() - Setting Last VM Error: [329] Error: [0xEDDD0149:{K8sApp::CreateTARWorker(3514)/Int.329.0x149-Error creating worker pod. [Success] in namespace [netbox] details:[Events: 1) Pod:<pod-name>: FailedScheduling: running PreBind plugin "VolumeBinding": binding volumes: pod does not exist any more: pod "<pod-name>" not found. ]}]
1627535 18d7f4 MM/YY HH:MM:SS 74286 VSBkpWorker::BackupVMFileCollection() - Failed to open file collection object.

Environment

vSphere Kubernetes services
vSphere Cloud Native Storage

Cause

Backup fails because snapshot based temporary PVCs using a WaitForFirstConsumer (late-binding) StorageClass do not bind within the expected time window (e.g., 240 seconds), resulting in volume binding or pod scheduling errors.

To Validate the Guest cluster storage class volumebinding we can execute below command:

kubectl get sc

NAME                                      PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE
test-sc                                   csi.vsphere.vmware.com   Retain          Immediate 
vsan-default-storage-policy-latebinding   csi.vsphere.vmware.com   Delete          WaitForFirstConsumer

Resolution

Kindly reach out to backup vendor for further assistance to fix the reported issue.