Restore of a namespace can fail when Operator is controlling the objects in the namespace
search cancel

Restore of a namespace can fail when Operator is controlling the objects in the namespace

book

Article ID: 345692

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
When a deployment in a namespace is controlled by an Operator like Minio or SQL the control of the objects is done by the operator system

Restore is failing to restore one of the PVs with error
message: 'error running restore err=chdir /host_pods/4c3b4055-4815-4e9b-b131-75a94c762748/volumes/kubernetes.io~csi/pvc-xxxx-xxxx-xxxx-xxxx-xxxx/mount:
no such file or directory: chdir /host_pods/4c3b4055-4815-4e9b-b131-75a94c762748/volumes/kubernetes.io~csi/pvc-xxxcccc-xxxx-xxxx-xxxx-xxxxxxxxxx/mount:
no such file or directory'
This can be observed in Restic pod node-agent-xxx
The objects have been changed during the restore 



Cause

This issue doesn't seem to be with velero or how k8s handles statefulsets but with the minio operator.

This is the process observed during restore 

  • velero restores the statefulset pods and statefulset itself. note that velero adds labels to the restored pods and statefulset to indicate the name of the backup and restore operation.
  • the copying of data into the pvc begins via the init-containers in the statefulset pods
  • minio operator notices that a statefulset was newly created and constructs its own expected statefulset. the statefulset constructed differs from the one velero just created in that the newly added velero labels from under `.metadata` get copied to `.spec.template.metadata` as well
  • since there is a difference, minio operator updates the statefulset to match its expected version
  • due to the update, the statefulset pods are rolled out again
  • the volume restores in progress on the old statefulset pods are interrupted. some of them fail and some hang indefinitely

The pods were being terminated because somebody had updated the statefulset (generation had increased since restore).  In general an "operator" or "controller" is responsible for watching and updating the resources that it owns, in this case the statefulset.

Resolution

Disable the reconciliation  process by reducing number of pod of the operator to 0 or pause active reconciliation


Workaround:

Validated workaround is by temporarily disabling the minio operator before starting the restore and then reenabling it once the restore completes. You can do this by editing the minio-operator deployment and setting replicas to 0 (you may need to change replicas via helm if the deployment is actively managed and gets auto reset).
Same would apply other subsystems that actively monitor a deployment or statefulset 

$ kubectl -n minio-operator edit deploy minio-operator

// verify no minio-operator pods
$ kubectl -n minio-operator get pod
CSI snapshots is alternative that would work since unlike FSB/restic, it does not depend on the pods for the volume restores. But CSI snapshot support is not yet ready on TKG(S) and therefore not available for such clusters through the TMC today.

In case the restore gets hung and you don't want to wait for it to fail. Basically just restart the velero pod.

kubectl -n velero rollout restart deploy velero


Additional Information

The area of the code in minio operator responsible for updating the statefulset:

https://github.com/minio/operator/blob/f7e74681516e5893bd9073c38ce56f55a19a75c9/pkg/controller/main-controller.go#L1319


Impact/Risks:
Restore of the pods and PVC might be inconsistent or get stuck for longer period of time