How to troubleshoot issues with BBR Restore of a multi-master VMware Tanzu Kubernetes Grid Integrated Edition cluster
search cancel

How to troubleshoot issues with BBR Restore of a multi-master VMware Tanzu Kubernetes Grid Integrated Edition cluster

book

Article ID: 298626

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During a BBR Restore of a multi-master VMware Tanzu Kubernetes Grid Integrated Edition cluster following the Restore procedures, and specifically after a failure in the "Redeploy clusters" step, the cluster could be in a non-running state that would block the "Restore clusters" step.

The resolution steps herein can be followed to get the cluster VM's up and running so that you can proceed to the "Restore clusters" step.

Environment

Tanzu Kubernetes Grid Integrated Edition v1.7 and above

Resolution

1. First, run `bosh vms` on the specific cluster deployment to confirm what are the status of the VM's. If the "Redeploy clusters" failed, then we are expecting that the state of each instance is not "running".

2. If you followed the Restore procedures to redeploy the cluster, then find out the bosh task id of the redeployment. You could get this by running `pks cluster <clustername>` and finding the bosh task id from one of the "Last Action" fields' values.

3. Run `bosh task <taskid> --debug` to get more info about the task and the failure.

4. If the output of step 3 mentioned any failed jobs from any VM, then `bosh ssh` into the VM and find the logs of the failed job.

5. If the failed job is 'etcd' and it is failing to start up because (based on the logs) it is trying to reach out to the other instances, then try to follow the How to shutdown and startup a multi-master PKS cluster KB article.

6. If the KB in step 5 still fails, then you might need to reset the etcd data and bootstrap a new etcd cluster. Try to follow the How to reset etcd data of a multi-master VMware Tanzu Kubernetes Grid Integrated Edition cluster KB article.

7. Once the cluster VM's are all healthy and in a "running" state. Then, you can proceed to "Restore clusters" step.