Large number of velero backups could lead to kube-api load and high memory consumption

search cancel

Large number of velero backups could lead to kube-api load and high memory consumption

book

Article ID: 390325

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During upgrade of a cluster the control plane VMs suddenly started going into failing state due to kube-api service running out of memory (consuming all available memory on the control planes)

During troubleshooting it was discovered large number of podvolumebackup objects in ETCD above 100 000 . Also a large number of velero backups completed in the last 30 days. more then 500 backups present.

Environment

TKGi 1.19

TKGi 1.20

Velero 1.13.x

Cause

This is specific situation where for each backup taken there are about 250 PVCs associated with it and this will create 250 podvolumebackup objects.

Having 30 backups - 1 for each day will result in creating 7500 podvolumebackups, in case of rapid increase of the number of backups the objects can quickly increase to very high number of these objects leading to slowness and high memory consumption.

node-agent pods run in each workers and host the controller for podVolumeBackup CRs (custom resources).

When API server restarts or node-agent pod restarts, the controllers in each worker tries to rebuild the cache (of the podVolumeBackup CRs) from the API server.
As a result, a List call for podVolumeBackup CRs is launched from each worker, which involves a burst request and memory usage of the API server.

Resolution

Reducing number of unused backups will lead reduction of the podVolumeBackup CRs as the podVolumeBackup objects are associated with a specific backup.

Future fix is planned to improve this behaviour and reduce the load to kube-api server.

Additional Information

Future updates of the velero regarding this behaviour tracked here:

https://github.com/vmware-tanzu/velero/issues/8764

Feedback

thumb_up Yes

thumb_down No