velero backup describe <BACKUP_NAME>
against the failed backup shows "Phase: Failed
" with no events velero backup logs <BACKUP_NAME>
returns: "An error occurred: file not found
"kubectl describe pod velero-<UNIQUE_POD_id> -n velero
shows the pod running, but the following is shown:Last State: Terminated
Reason: OOMKilled
Exit Code: 137
This problem may occur on any TKG, TKGm, or TKGS environment. It is symptomatic of incorrect Velero deployment/pod configuration and is not bound to the Kubernetes provider.
This condition is presented because the Velero pod has insufficient memory to satisfy the backup demands. The memory limit on the Velero pod is hit and causes the pod to crash and restart due to the OOMKiller.
Modify the Velero deployment and increase the memory Limit:
Example:
Limits:
cpu: 1
memory: 512Mi
Requests:
cpu: 500m
memory: 128Mi
Change to:
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 128Mi
The limit will be workload dependent and may required more than 1Gi of memory. Testing of backups and tuning will be required to identify exactly how much memory is required for successful backups.
NOTE: Modifying the memory limit in the deployment will cause a pod rollout for the Velero pod.