Deleting TKGi Worker node VM might result in migrated in-tree volumes missing

Products

VMware VMware vSphere with Tanzu

Issue/Introduction

This Knowledge Base article is created to provide a workaround that will prevent data loss from happening when there is a Worker node VM recreation event under certain situations.

Symptoms:

Pods cannot start as associated PVCs are not found
Associated VMDK is missing from the vSphere's datastore
volumeattachements shows PVs no longer exist

Environment

Tanzu Kubernetes Grid Integrated Edition 1.1.14.0
VMware Tanzu Kubernetes Grid Integrated Edition 1.x
Tanzu Kubernetes Grid Integrated Edition 1.1.13.5

Cause

Migrated in-tree vSphere volumes do not have the FCD control flag - keepAfterDeleteVm. So if a worker node gets deleted, then all migrated in-tree volumes being attached to the node will be removed automatically as well.

Resolution

Upgrade TKGi to a safer version - TKGi 1.13.6+ or TKGi 1.14.1+

Workaround:

Prerequisites:

There are two prerequisites:

The cluster version is older than TKGi 1.13.6(CSI 2.4.2) or TKGi 1.14.1 (CSI 2.5.2);
There are migrated in-tree VCP volumes.

Users may run into the data loss issue if and only if both conditions above are true. Otherwise, there is no issue, and accordingly no need to follow the workaround steps.

Notice:

If you haven't performed the VCP→ CSI migration, then you should upgrade to a safe TKGi version (1.13.6 or 1.14.1) before performing the migration. If you have already performed the migration on an unsafe TKGi version, then follow the workaround steps below.

Steps to Follow:

Step 1: Disable bosh resurrection

Users are supposed to disable BOSH resurrection before performing any operation.
$ bosh update-resurrection off

Note - Users are allowed to perform any operations, such as cluster-upgrade, between Step 1 (disable BOSH resurrection) and Step 3 (enable BOSH resurrection).

If any worker node is unresponsive during the operation, then the operation (i.e. cluster-upgrade) may fail. Please follow Step 2 to workaround it when running into such a situation, and retry the previously failed operation afterward.

Step 2: Manually fix unresponsive worker node if any

If there isn't any unresponsive worker node, then no action is required in this step.
If you want to test the steps in your test environment, then you can simulate them by executing the below command on any worker node.
# sv stop agent

Follow the steps below to fix the unresponsive node.

1. List all POD that (1) are running on the unresponsive node and (2) have PVC in use

$ export NODE_NAME=32c1090d-cd3c-4d24-93d4-3fdf0153dcff
$ kubectl get pod -A --field-selector spec.nodeName=${NODE_NAME} -o=json | jq -c '.items[] | {name: .metadata.name, namespace: .metadata.namespace, claimName: .spec | select( has ("volumes") ).volumes[] | select( has ("persistentVolumeClaim") ).persistentVolumeClaim.claimName }'
Please replace the node name(32c1090d-cd3c-4d24-93d4-3fdf0153dcff ) in the example above with your unresponsive node.
The output is something like below,
{"name":"nginx-deployment-7d7948fb9c-djhl4","namespace":"default","claimName":"pvcsc-vsan"}

2. Cordon the worker node,
$ kubectl cordon 32c1090d-cd3c-4d24-93d4-3fdf0153dcff

3. Forcibly delete all PODs returned at the above step
$ kubectl delete pod nginx-deployment-7d7948fb9c-djhl4 --force

4. Wait for all PODs deleted above to be running again.

If users run into the known node non-gracefully shutdown issue, then please follow https://knowledge.broadcom.com/external/article?legacyId=85213 to work around it.

5. Recreate the unresponsive worker node. (Use your service instance id & worker node id)
$ bosh -d service-instance_7197b487-239d-4305-a63e-432b06376f29 recreate worker/c00197e7-5a29-4b14-98c6-cc10d3209524 --fix

Step 3: Enable bosh resurrection

When all clusters are successfully upgraded to a safe version, enable bosh resurrection again.
$ bosh update-resurrection on

Additional Information

Impact/Risks:
Certain situations can result in the automatic deletion of associated PVs & associated VMDK files when there is a Worker node VM recreation event.