Velero backups show Failed status due to velero pod "OOMKilled" crash
search cancel

Velero backups show Failed status due to velero pod "OOMKilled" crash

book

Article ID: 369970

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Service (TKGs) Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x

Issue/Introduction

  • TKG cluster backups fail with no logs and no errors
  • Running velero backup describe <BACKUP_NAME> against the failed backup shows "Phase: Failed" with no events 
  • Running velero backup logs <BACKUP_NAME> returns: "An error occurred: file not found"
  • When the backup is run, the Velero pod will crash and restart
  • Describing the Velero pod with kubectl describe pod velero-<UNIQUE_POD_id> -n velero shows the pod running, but the following is shown:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

  • Logging for the Velero pod reports no errors and appears to stop midway through backup operation
  • With "apiVersion": "velero.io/v1", "kind": "Backup", Error: "failureReason": "found a backup with status "InProgress" during the server starting, mark it as "Failed". 
    • This failure means that the velero pod was restarted in the middle of the backups.
    • This can happen if the velero pod is being restarted due to OOMKilled.

Environment

This problem may occur on any TKG, TKGm, or TKGS environment. It is symptomatic of incorrect Velero deployment/pod configuration and is not bound to the Kubernetes provider.

Cause

This condition is presented because the Velero pod has insufficient memory to satisfy the backup demands. The memory limit on the Velero pod is hit and causes the pod to crash and restart due to the OOMKiller.

Resolution

Modify the Velero deployment and increase the memory Limit:

 

Example:

    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     500m
      memory:  128Mi

 

Change to:

 

    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  128Mi

 

The limit will be workload dependent and may required more than 1Gi of memory. Testing of backups and tuning will be required to identify exactly how much memory is required for successful backups.

 

NOTE: Modifying the memory limit in the deployment will cause a pod rollout for the Velero pod.