Velero Backups Failing , Restic attempts to prune backups but the prune process gets killed.

Products

VMware Tanzu Mission Control

Issue/Introduction

Symptoms:
In Velero from TMC (Data Protection) Backups continuously fails and Several backups stuck in a deleting state for several clusters.

Whenever Restic attempts to prune backups, the prune process gets killed, leaving behind several locks. The locks in turn prevents Restic backups from proceeding.

Velero Backup describe command shows the below error:

 Errors:
    error running command=restic forget --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic 92f09770, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532)
lock was created at 2023-02-01 13:21:51 (4m8.270026172s ago)
storage ID 01187448
the `unlock` command can be used to remove stale locks
: exit status 1

kubectl describe resticrepository.velero.io shows the below error:

Message:                error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=loading indexes...
loading all snapshots...
finding data that is still in use for 100 snapshots
, stderr=: signal: killed

Velero pod logs show below errors:

time="2023-02-01T13:24:08Z" level=info msg="Running maintenance on restic repository" logSource="pkg/controller/restic_repository_controller.go:192" resticRepo=velero/tbi-loki-wld03-wvp6p
time="2023-02-01T13:24:08Z" level=info msg="Removing existing deletion requests for backup" backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:414"
time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:145"
time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:203"
time="2023-02-01T13:24:08Z" level=info msg="deletion request 'wld03-full-daily-20221227210027-6jpgk' removed." backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:431"
time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:145"
time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:203"
time="2023-02-01T13:24:08Z" level=warning msg="error pruning repository" error="error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532)\nlock was created at 2023-02-01 13:21:51 (2m17.8354312s ago)\nstorage ID 01187448\nthe `unlock` command can be used to remove stale locks\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:296" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:198" resticRepo=velero/tbi-loki-wld03-wvp6p

Problem persists after restarting both velero and restic pods.

Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

First, the Restic repository prune starts (it will create a lock file), and after a while, the prune command fails (the command is killed, so it may not have a chance to clear the lock file). In the end, the backup fails due to the existing lock file.

message: |-
    running Restic backup, stderr=unable to create lock in backend: repository is already locked exclusively by PID 374329 on velero-68dbc9f6d9-ddmcf by nonroot (UID 65532, GID 65532)
    lock was created at 2023-01-08 21:01:10 (45.062813009s ago)
    storage ID 34c11c53
    the `unlock` command can be used to remove stale locks
    : exit status 1
  phase: Failed

Not enough memory could be the reason and the kubelet process kill the prune command due to OOM.

stderr=: signal: killed
See Restic issue: https://github.com/restic/restic/issues/3844

It seems Velero already does clean up stale locks by itself so manual cleanup will not help. And the locks seem to be left over due to the prune command getting killed. So what seems to be happening is:

stale locks cleaned up --> prune creates new lock files and gets killed --> wait --> loop

So when a backup is in progress, the volume backups using the Restic repository fail due to the lock files being present.

Resolution

No Known resolution yet, Just a workaround to solve the issue

Workaround:

The value of spec.MaintenanceFrequency may not matter and already be a higher value - but since the prune commands are failing, it's not being respected anyway. But we can always confirm by checking the value in:

$ kubectl -n velero get resticrepositories.velero.io -o yaml

(no value probably means default of 7 days)

Since we suspect the prune command is getting killed due to OOM, we can increase the resources for the Velero deployment

$ kubectl -n velero edit deployment velero

The current values are probably the recommended defaults

VeleroCPURequest    = "500m"
VeleroMemoryRequest = "128Mi"
VeleroCPULimit      = "1000m"
VeleroMemoryLimit   = "512Mi"

 to the following limit values:

Limits:

  Containers:

   velero:

    Image:      extensions.aws-euw1.tmc.cloud.vmware.com/extensions/data-protection-images/velero-v1-9-5@sha256:f9930c7448f717a252895c266e9c662ce59f971f7a8849176f7f3ed75964d87d

    Port:       8085/TCP

    Host Port:  0/TCP

    Command:

      /velero

    Args:

      server

    Limits:

      cpu:     4

      memory:  16Gi

    Requests:

      cpu:     500m

      memory:  128Mi

The velero deployment has multiple init containers so make sure you change the limits for the main container - or both the init & main containers since it's a max anyway. And note we're doing this for the velero deployment and not the restic daemonset.

Additional Information

Impact/Risks:
Volume backup fails and backups get stuck in deleting state