Symptoms:
- You have tried deleting a Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster by running tkgi delete-cluster <cluster-name> command.
- You see that the bosh deployment was deleted but the cluster deletion from TKGI is still in progress for some time.
- When you check the cluster status, you that is is Instance forced deletion in progress.
- In the bosh recent tasks, you see drain-cluster, delete-deployment tasks were completed successfully, and drain-windows-cluster task triggered even though it is not a Windows cluster. The drain-windows-cluster task shows a status of failed.
- In the pks-api.log on the PKS API vm, you see the timeout entries similar to the following:
2021-09-23 12:31:32.038 ERROR 609 — [ op-poll-sub-1] i.p.p.c.BrokerLastOperationPollingTask : Could not update cluster ClusterEntity{name='test-1', uuid='e26a055c-8e3f-471e-86c7-a5bb10a4222c', owner='osb-pks-int-client', brokerOperationId='{"BoshTaskID":1177663,"BoshContextID":"2aa3ea3f-4dc3-4279-9e4d-c10023e1861d","OperationType":"force-delete","PostDeployErrand":{},"PreDeleteErrand":{},"Errands":[{"Name":"drain-cluster","Instances":null},{"Name":"drain-cluster-windows","Instances":null}]}',lastAction='DELETE', lastActionState='in progress', lastActionDescription='Instance forced deletion in progress', lastCompletedAction='DELETE', lastCompletedActionState='in progress', lastCompletedActionDescription='Instance forced deletion in progress', planId='8A0E21A8-8072-4D80-B365-D1F502085560', masterIps='[10.93.83.181]',parameters=io.pivotal.pks.cluster.data.ClusterParametersEntity@2ecaf285',networkProfileUuid=5a7e8ec5-96fa-4ce3-ad93-eac9425a5dd5', computeProfileUuid=null', maintenanceInfo=MaintenanceInfoVO[publicProperties={docker=20.10.7, kubernetes=1.20.6, pks=1.11.2-build.2, stemcell=621.136}, privateProperty='null', version='null']', attributes={custom_ca=false, current_ca_pem=----BEGIN CERTIFICATE----feign.RetryableException: timeout executing GET https://localhost:3000/v2/service_instances/e26a055c-8e3f-471e-86c7-a5bb10a4222c/last_operation?operation=%7B%22BoshTaskID%22:1177663,%22BoshContextID%22:%222aa3ea3f-4dc3-4279-9e4d-c10023e1861d%22,%22OperationType%22:%22force-delete%22,%22PostDeployErrand%22:%7B%7D,%22PreDeleteErrand%22:%7B%7D,%22Errands%22:%5B%7B%22Name%22:%22drain-cluster%22,%22Instances%22:null%7D,%7B%22Name%22:%22drain-cluster-windows%22,%22Instances%22:null%7D%5D%7D
- On the pks-api vm, you see that ncp_cleanup script spawned multiple times for the same cluster:
$ ps -ef |grep "ncp_cleanup”
root 13806 13308 0 14:16 ? 00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true
root 13916 13308 0 14:18 ? 00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true
root 14187 13308 0 14:21 ? 00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f true