The vSphere plugin for Velero fails to backup persistent volumes when non-vSphere persistent volumes are present

search cancel

The vSphere plugin for Velero fails to backup persistent volumes when non-vSphere persistent volumes are present

book

Article ID: 317074

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

Symptoms:

A Kubernetes cluster has persistent volumes present with differing configurations. There may be persistent volumes using vSphere CSI but others may use NFS or some other backing.
A Velero backup using the vSphere plugin to perform a volume snapshot of a vSphere CSI backed persistent volume is created and appears to complete successfully.
The Velero upload process never exists the New phase. You see output similar to the following when running kubectl -n velero get uploads.veleroplugin.io -o yaml:

items:
- apiVersion: veleroplugin.io/v1
  kind: Upload
  metadata:
    creationTimestamp: "2020-05-05T14:32:11Z"
    generation: 1
    name: upload-154d2c24-49b0-40eb-a4c2-76d68ae43579
    namespace: velero
    resourceVersion: "213757"
    selfLink: /apis/veleroplugin.io/v1/namespaces/velero/uploads/upload-154d2c24-49b0-40eb-a4c2-76d68ae43579
    uid: 65e0a916-6575-433c-a2f5-651a23bd2892
  spec:
    backupTimestamp: "2020-05-05T14:32:11Z"
    snapshotID: ivd:c25cc856-0d9e-44ec-b542-cd51a8e54dba:154d2c24-49b0-40eb-a4c2-76d68ae43579
  status:
    nextRetryTimestamp: "2020-05-05T14:32:11Z"
    phase: New
    progress: {}

When you examine the pods running in the velero namespace, you see that the datamgr-for-vsphere-plugin pods are in a CrashLoopBackoff or Error state.

NAME READY STATUS RESTARTS AGE
datamgr-for-vsphere-plugin-fn9pp 0/1 Error 4 30m

You see messages similar to the following when you review the logs for the datamgr-for-vsphere-plugin pods:

time="2020-05-05T00:22:20Z" level=info msg="Filtering out the upload request from nodes other than test-md-0-7c9d5d4cd-wcrcr" controller=upload generation=1 logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/controller/upload_controller.go:121" name=upload-08b98b68-e545-481c-9855-a697f7b94161 namespace=velero phase=New
2020/05/05 00:22:20 pei = ivd:5ac38391-4773-4c1f-9d88-d8fd7105a9fa:08b98b68-e545-481c-9855-a697f7b94161
E0505 00:22:20.391671 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 38 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1e39b60, 0x381a630)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware Tanzu Kubernetes Grid Plus 1.x
VMware Tanzu Kubernetes Grid 1.x
VMware Cloud Native Storage 1.x
VMware PKS 1.x

Cause

This issue can occur when there are non-vSphere persistent volumes present in the cluster.

Resolution

This is a known issue affecting the vSphere plugin for Velero. There is currently no resolution.

Workaround:
You can workaround this issue by only having vSphere CSI backed persistent volumes present in a cluster where you will be running Velero backups using the vSphere plugin to perform volume snapshots. Alternatively, you can perform Velero backups without taking volume snapshots.

To fix the datamgr-for-vsphere-plugin pods that are in a CrashLoopBackoff or Error state, you can delete the Velero uploads and recreate the pods by issuing commands similar to the following:

kubectl -n velero get uploads
kubectl -n velero delete upload <upload ID from previous command>
kubectl -n velerro delete pod datamgr-for-vsphere-plugin-<UUID>

Additional Information

Velero Plugin for vSphere

Feedback

thumb_up Yes

thumb_down No