How to restore a single PKS cluster when persistent disks for the cluster are lost.

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

How to restore a single TKG cluster when persistent disks for the cluster are lost.

Note: A few things to look out for before following the instructions:

The Instructions section provides steps for the whole process (if an equivalent disaster event has already occurred and you have the backup, you can just follow the instructions for redeploying and restoring the cluster listed below):
- creating a test cluster
- creating its backup
- creating a disaster scenario
- redeploying the test cluster
- restoring stateless workloads of the test cluster
This procedure is only tested with TKG clusters on the vSphere environment running flannel as its CNI plugin. It is recommended to try this procedure on a test cluster first (before trying to restore the damaged cluster) if you are on a different IaaS (for example AWS, GCP, vSphere with NSX-T etc)
This procedure does not restore stateful workloads (workloads using PVs, load balancers, etc)
Ops Manager/Bosh, Enterprise PKS deployment states have not experienced any disaster event. If you run the following 2 commands you can see that the problematic cluster record is still available to TKG as well as the Bosh director.

pks clusters

bosh -d service_instance-<GUID-Problematic-cluster> vms

Environment

Product Version: 1.7

Resolution

Note: These instructions are meant to be high-level in terms of explaining the procedure. It is assumed that you are familiar with the general procedure and know how to gather credentials required for the running the commands used in this procedure.

Step 1 - Create an Enterprise TKG Cluster

Example (cluster1 - is the test cluster)

pks create-cluster cluster1 -e cluster1.api.pks.io -p medium -n 1

After the cluster is created, run the following to display Instance, Process state, VM CIDs, and Disk CIDs command:

bosh -d service-instance_<GUID> is --details --column=Instance --column=Process\ State --column=VM\ CID --column=Disk\ CIDs

Example output:

Instance                                           Process State  VM CID                                   Disk CIDs
apply-addons/12a8a409-c207-4e78-a39d-3e32a6a68e09  -              -                                        -
master/406f0925-06fc-4f29-af72-33d48abd8e1e        running        vm-c3980044-5d44-4f7d-a4ec-490e80b78511  disk-1b9814d8-2060-40b2-a724-334bff5f7614
master/539e2364-047d-4e04-bfbb-37ffd86bf3d4        running        vm-fc80f758-8269-4c0e-a4e1-892545c39a9b  disk-291c8c18-15f7-41a4-bc8b-417f3daa170e
master/5765f3ff-80db-46ef-8d8b-c2d5035bc8cb        running        vm-1b475d16-ddc6-452a-b23e-aa7e07357ba1  disk-acd3807f-0969-4511-989b-171f699ad2b6
worker/c41eac49-6b71-4108-9e3f-c4a9a46b7286        running        vm-a6ca3349-6211-44bc-b35d-259c330c5c0f  disk-c0c6b0d1-bd38-4a7e-82bb-4a69acefb4e7

Step 2 - Create backup of the test cluster (cluster1)

Credentials for the following command can be gathered from here:
Prepare to Backup

pre-backup-check:

BOSH_CLIENT_SECRET='xxxxx' bbr deployment --all-deployments --target xxxx --username xxxxx --ca-cert xxxxx pre-backup-check

backup:

Make sure to include --with-manifest option with the following command:

BOSH_CLIENT_SECRET=xxxxx nohup bbr deployment --deployment service-instance_xxxxx --target xxxxx --username pivotal-container-service-xxxx --ca-cert xxxxx backup --with-manifest

Step 3 - Create a Disaster scenario

This is just a scenario which you can create to delete the persistent disks used by the test cluster. An equivalent event might have already happened in your environment in which case you can skip this step.

From your IaaS console, power off the VMs of the PKS cluster (cluster1). (Use VM CIDs obtained in step 1)
Make sure that Bosh marks process state for each VM of the PKS cluster (cluster1) as unresponsive agent
For each Disk CID obtained in step 1 (Disk CIDs column), run the following command (this will make Persistent Disk of each VM to be marked as orphaned by Bosh)

bosh -d service_instance-<GUID> orphan-disk <Disk CID>

Run a Bosh cleanup to delete the orphaned Persistent disks (this point will be considered a scenario where Persistent disks for cluster1 will be lost)

bosh clean-up --all

Step 4 - Redeploying the test cluster (cluster1)

From a previous backup or the backup created in Step 2, navigate to backup director (service_instance-<GUID-TIMESTAMP>) to use the manifest.yml for the following command to redeploy the test cluster (cluster1):

bosh -d service-instance_xxxxx deploy service-instance_xxxxx_20200623T191343Z/manifest.yml --fix

This command will recreate the VMs of our test cluster (cluster1) and will also make sure that new persistent disks are provisioned and are attached to their corresponding VMs. After successful deployment, you will be able to access the test cluster using the TKG CLI (it might require you to login and get-credentials) but the cluster will not have any workloads running in it (kubectl get all -A) The next step will take care of restoring the stateless workloads for the test cluster.

Step 5 - Restore stateless workloads to the test cluster using a backup

This step only requires us to run the following BBR restore command to restore the stateless workloads:

BOSH_CLIENT_SECRET=xxxx nohup bbr deployment --deployment service-instance_xxxxx --target xxxxx --username pivotal-container-service-xxxxx --ca-cert xxxxx restore --artifact-path service-instance_xxxx_xxxx/

Where --artifact-path is the directory which contains the backup artifacts of the test cluster.

After this operation completes, you can do use TKG CLI to do tkg login & get-credentials again and then do a sanity check to make sure that workloads are restored (kubectl get all -A).