Velero backups are reporting status 'Failed' in Tanzu Mission Control, and several of the TMC Agent pods are reporting "failed to reconcile agent' errors similar to the log snipit below:
,"error":"get agent resource from backend: rpc error: code = Unavailable desc = get management cluster from backend: error reading from server: ########## read tcp connection reset by peer","level":"error","msg":"failed to reconcile agent","time"
Kubernetes clusters attached to Tanzu Mission Control and configured for Velero Backups.
To resolve this issue and return the Tanzu Mission Control Agents to 'Healthy'' status by initiating a rollout restart of the deployments in the vmware-system-tmc namespace and the velero namespace.
1. Connect via SSH to the impacted cluster.
2. Gather the deployments in the vmware-system-tmc namespace:
# kubectl get deploy -n vmware-system-tmc
3. Restart all deployment pods in the vmware-system-tmc namespace:
# kubectl -n vmware-system-tmc rollout restart deploy DEPLOYMENT-NAME
4. Verify all pods are running in the vmware-system-tmc namespace after completing the rollout restart of the deployment pods.
# kubectl get pods -n vmware-system-tmc
5. Gather the Velero namespace deployment
# kubectl get deployment -n velero
6. Restart all deployment pods in the Velero namespace
# kubectl -n velero rollout restart deployment velero
7. Verify all pods are running in the velero namespace after initiating the rollout restart of the deployment pods. .
# kubectl get pods -n velero
8. Verify the Velero backups are working again by logging into Tanzu Mission Control and initiating a new Velero backup that reports status "Completed" once the backup is done.
Detailed steps for creating a new Velero backups using Tanzu Mission Control can be found on the Tanzu Mission Control documentation located here.