Errors: error running command=restic forget --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic 92f09770, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532) lock was created at 2023-02-01 13:21:51 (4m8.270026172s ago) storage ID 01187448 the `unlock` command can be used to remove stale locks : exit status 1
kubectl describe resticrepository.velero.io shows the below error:
Message: error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=loading indexes... loading all snapshots... finding data that is still in use for 100 snapshots , stderr=: signal: killed
time="2023-02-01T13:24:08Z" level=info msg="Running maintenance on restic repository" logSource="pkg/controller/restic_repository_controller.go:192" resticRepo=velero/tbi-loki-wld03-wvp6p time="2023-02-01T13:24:08Z" level=info msg="Removing existing deletion requests for backup" backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:414" time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:145" time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221204210056 controller=gc expiration="2023-01-03 21:00:56 +0000 UTC" logSource="pkg/controller/gc_controller.go:203" time="2023-02-01T13:24:08Z" level=info msg="deletion request 'wld03-full-daily-20221227210027-6jpgk' removed." backup=wld03-full-daily-20221227210027 controller=backup-deletion deletebackuprequest=velero/wld03-full-daily-20221227210027-7bxqv logSource="pkg/controller/backup_deletion_controller.go:431" time="2023-02-01T13:24:08Z" level=info msg="Backup has expired" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:145" time="2023-02-01T13:24:08Z" level=info msg="Creating a new deletion request" backup=velero/wld03-full-daily-20221217210048 controller=gc expiration="2023-01-16 21:00:48 +0000 UTC" logSource="pkg/controller/gc_controller.go:203" time="2023-02-01T13:24:08Z" level=warning msg="error pruning repository" error="error running command=restic prune --repo=s3:https://<S3>:9001/wld03/01GHDX9S5SVH2KXARM1JAVP2BZ/restic/tbi-loki --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=unable to create lock in backend: repository is already locked exclusively by PID 1398984 on velero-68dbc9f6d9-586m8 by nonroot (UID 65532, GID 65532)\nlock was created at 2023-02-01 13:21:51 (2m17.8354312s ago)\nstorage ID 01187448\nthe `unlock` command can be used to remove stale locks\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:296" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:198" resticRepo=velero/tbi-loki-wld03-wvp6pProblem persists after restarting both velero and restic pods.
message: |- running Restic backup, stderr=unable to create lock in backend: repository is already locked exclusively by PID 374329 on velero-68dbc9f6d9-ddmcf by nonroot (UID 65532, GID 65532) lock was created at 2023-01-08 21:01:10 (45.062813009s ago) storage ID 34c11c53 the `unlock` command can be used to remove stale locks : exit status 1 phase: Failed
No Known resolution yet, Just a workaround to solve the issue
The value of spec.MaintenanceFrequency may not matter and already be a higher value - but since the prune commands are failing, it's not being respected anyway. But we can always confirm by checking the value in:
$ kubectl -n velero get resticrepositories.velero.io -o yaml
(no value probably means default of 7 days)
Since we suspect the prune command is getting killed due to OOM, we can increase the resources for the Velero deployment
$ kubectl -n velero edit deployment velero
The current values are probably the recommended defaults VeleroCPURequest = "500m" VeleroMemoryRequest = "128Mi" VeleroCPULimit = "1000m" VeleroMemoryLimit = "512Mi"
to the following limit values:
Limits: Containers: velero: Image: extensions.aws-euw1.tmc.cloud.vmware.com/extensions/data-protection-images/velero-v1-9-5@sha256:f9930c7448f717a252895c266e9c662ce59f971f7a8849176f7f3ed75964d87d Port: 8085/TCP Host Port: 0/TCP Command: /velero Args: server Limits: cpu: 4 memory: 16Gi Requests: cpu: 500m memory: 128Mi
The velero deployment has multiple init containers so make sure you change the limits for the main container - or both the init & main containers since it's a max anyway. And note we're doing this for the velero deployment and not the restic daemonset.