Tanzu Kubernetes Grid Integrated Edition cluster deletion shows as in progress after the cluster has been deleted

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:

You have tried deleting a Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster by running tkgi delete-cluster <cluster-name> command.
You see that the bosh deployment was deleted but the cluster deletion from TKGI is still in progress for some time.
When you check the cluster status, you that is is Instance forced deletion in progress.
In the bosh recent tasks, you see drain-cluster, delete-deployment tasks were completed successfully, and drain-windows-cluster task triggered even though it is not a Windows cluster. The drain-windows-cluster task shows a status of failed.
In the pks-api.log on the PKS API vm, you see the timeout entries similar to the following:

2021-09-23 12:31:32.038 ERROR 609 — [ op-poll-sub-1] i.p.p.c.BrokerLastOperationPollingTask : Could not update cluster ClusterEntity{name='test-1', uuid='e26a055c-8e3f-471e-86c7-a5bb10a4222c', owner='osb-pks-int-client', brokerOperationId='{"BoshTaskID":1177663,"BoshContextID":"2aa3ea3f-4dc3-4279-9e4d-c10023e1861d","OperationType":"force-delete","PostDeployErrand":{},"PreDeleteErrand":{},"Errands":[{"Name":"drain-cluster","Instances":null},{"Name":"drain-cluster-windows","Instances":null}]}',lastAction='DELETE', lastActionState='in progress', lastActionDescription='Instance forced deletion in progress', lastCompletedAction='DELETE', lastCompletedActionState='in progress', lastCompletedActionDescription='Instance forced deletion in progress', planId='8A0E21A8-8072-4D80-B365-D1F502085560', masterIps='[10.93.83.181]',parameters=io.pivotal.pks.cluster.data.ClusterParametersEntity@2ecaf285',networkProfileUuid=5a7e8ec5-96fa-4ce3-ad93-eac9425a5dd5', computeProfileUuid=null', maintenanceInfo=MaintenanceInfoVO[publicProperties={docker=20.10.7, kubernetes=1.20.6, pks=1.11.2-build.2, stemcell=621.136}, privateProperty='null', version='null']', attributes={custom_ca=false, current_ca_pem=----BEGIN CERTIFICATE----feign.RetryableException: timeout executing GET https://localhost:3000/v2/service_instances/e26a055c-8e3f-471e-86c7-a5bb10a4222c/last_operation?operation=%7B%22BoshTaskID%22:1177663,%22BoshContextID%22:%222aa3ea3f-4dc3-4279-9e4d-c10023e1861d%22,%22OperationType%22:%22force-delete%22,%22PostDeployErrand%22:%7B%7D,%22PreDeleteErrand%22:%7B%7D,%22Errands%22:%5B%7B%22Name%22:%22drain-cluster%22,%22Instances%22:null%7D,%7B%22Name%22:%22drain-cluster-windows%22,%22Instances%22:null%7D%5D%7D

On the pks-api vm, you see that ncp_cleanup script spawned multiple times for the same cluster:

$ ps -ef |grep "ncp_cleanup”
root     13806 13308 0 14:16 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true
root     13916 13308 0 14:18 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true
root     14187 13308 0 14:21 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true

Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Resolution

This is a known issue affecting TKGI. There is currently no resolution.

Workaround:

To work around this issue, perform the following steps on the pks-api vm:

Make a copy of the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file.
Open the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file with a text editor.
Comment out the following section:

$pksnsxcli cleanup \
--nsx-manager-host='60.0.0.2' \
-c $nsx_manager_client_cert_file \
-k $nsx_manager_client_key_file \
\
--nsx-ca-cert-path=$nsx_manager_ca_cert_file \
\
--insecure='false' \
--cluster "$k8s_cluster_name" \
--t0-router-id="$t0_router_id" \
--pks=false \
--read-only=false \
--force=$force_delete

Save and close the file.
Wait for 11 minutes to see whether the cluster has been deleted successfully.
Revert the change to the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file made in Step 2
Search for the deleted cluster's UUID in NSX-T manager. If there are any instances of it remaining, run the following command to forcibly delete them from NSX-T:

/bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup <cluster-id> true