Intermittent backup failures when backing up Persistent Volume Claims for Guest clusters
search cancel

Intermittent backup failures when backing up Persistent Volume Claims for Guest clusters

book

Article ID: 424245

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Backup operations to protect Persistent Volume Claims associated with Guest cluster failing intermittently.

  • A delay occurs when Commvault attempts to provision a temporary worker pod and Persistent Volume Claim to read data from a snapshot volume. 

  • On checking persistentvolumeclaims and the pod from Guest cluster are stuck in Pending state.

    • kubectl get pvc -n <namespace>

NAME                                    STATUS       VOLUME                                            CAPACITY      ACCESS MODES       STORAGECLASS         AGE
######## - #### - #### - ############   Bound        pvc- ######## - #### - #### - ############         8Gi           RWO               storage-class        10d
######## - #### - #### - ############   Pending      pvc- ######## - #### - #### - ############        10Gi           RWO               storage-class        3m52s

    • kubectl get pods -n <namespace>

 NAME                                   READY         STATUS           RESTARTS   AGE         IP      NODE                        
######## - #### - #### - ############   0/1           Pending            0        3m52s     <none>   ######## - #### - #### - ############ 
######## - #### - #### - ############   1/1           Running            0        45h       <none>   ######## - #### - #### - ############ 
######## - #### - #### - ############   1/1           Running            0        45h       <none>   ######## - #### - #### - ############

 

  • Guest cluster CSI provisioner logs show a volume provisioning failure 

/var/log/pods/vmware-system-csi_vsphere-csi-controller-########-########/csi-provisioner/ 0.log

YYYY-MM-DDTHH:MM:SS.672945273Z stderr F E1228 HH:MM:SS controller.go:957] error syncing claim "######## - #### - #### - ############ ": failed to provision volume with StorageClass "<Storage-class-name>": rpc error: code = Internal desc = failed to create volume on namespace: <namespace> in supervisor cluster. Error: persistentVolumeClaim ######## - #### - #### - ############  in namespace <namespace> not in phase Bound within 240 seconds. reason: failed to provision volume with StorageClass "Storage-class-name": rpc error: code = Internal desc = failed to create volume. Error: failed to get the compatible datastore for create volume from snapshot ######## - #### - #### - ############   with error: <nil>

Commvault's backup log:

1627535 18d7f4 MM/YY HH:MM:SS 74286 CK8sInfo::OpenVmdk() - Failed to create worker [<worker-name>] for app [### PersistentVolumeClaim ######## - #### - #### - ############ ].
1627535 18d7f4 MM/YY HH:MM:SS 74286 CK8sInfo::SetLastVMErrorFromQiError() - Setting Last VM Error: [329] Error: [0xEDDD0149:{K8sApp::CreateTARWorker(3514)/Int.329.0x149-Error creating worker pod. [Success] in namespace [netbox] details:[Events: 1) Pod:<pod-name>: FailedScheduling: running PreBind plugin "VolumeBinding": binding volumes: pod does not exist any more: pod "<pod-name>" not found. ]}]
1627535 18d7f4 MM/YY HH:MM:SS 74286 VSBkpWorker::BackupVMFileCollection() - Failed to open file collection object.

Environment

  • vSphere Kubernetes services 
  • vSphere Cloud Native Storage 

Cause

Backup fails because snapshot based temporary PVCs using a WaitForFirstConsumer (late-binding) StorageClass do not bind within the expected time window (e.g., 240 seconds), resulting in volume binding or pod scheduling errors.

To Validate the Guest cluster storage class volumebinding we can execute below command: 

kubectl get sc

NAME                                      PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE
test-sc                                   csi.vsphere.vmware.com   Retain          Immediate 
vsan-default-storage-policy-latebinding   csi.vsphere.vmware.com   Delete          WaitForFirstConsumer

Resolution

Kindly reach out to backup vendor for further assistance to fix the reported issue.