Velero backups show Failed status due to velero pod "OOMKilled" crash

search cancel

Velero backups show Failed status due to velero pod "OOMKilled" crash

book

Article ID: 369970

calendar_today

Updated On: 08-27-2024

Products

VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Service (TKGs) Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x

Issue/Introduction

TKG cluster backups fail with no logs and no errors
Running velero backup describe <BACKUP_NAME> against the failed backup shows "Phase: Failed" with no events
Running velero backup logs <BACKUP_NAME> returns: "An error occurred: file not found"
When the backup is run, the Velero pod will crash and restart
Describing the Velero pod with kubectl describe pod velero-<UNIQUE_POD_id> -n velero shows the pod running, but the following is shown:

Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Logging for the Velero pod reports no errors and appears to stop midway through backup operation
With "apiVersion": "velero.io/v1", "kind": "Backup", Error: "failureReason": "found a backup with status "InProgress" during the server starting, mark it as "Failed".
- This failure means that the velero pod was restarted in the middle of the backups.
- This can happen if the velero pod is being restarted due to OOMKilled.

Environment

This problem may occur on any TKG, TKGm, or TKGS environment. It is symptomatic of incorrect Velero deployment/pod configuration and is not bound to the Kubernetes provider.

Cause

This condition is presented because the Velero pod has insufficient memory to satisfy the backup demands. The memory limit on the Velero pod is hit and causes the pod to crash and restart due to the OOMKiller.

Resolution

Modify the Velero deployment and increase the memory Limit:

Example:

Limits:
cpu: 1
memory: 512Mi
Requests:
cpu: 500m
memory: 128Mi

Change to:

Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 128Mi

The limit will be workload dependent and may required more than 1Gi of memory. Testing of backups and tuning will be required to identify exactly how much memory is required for successful backups.

NOTE: Modifying the memory limit in the deployment will cause a pod rollout for the Velero pod.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No