Tanzu Kubernetes Grid Integrated Edition cluster deletion shows as in progress after the cluster has been deleted
Article ID: 317450


VMware Tanzu Kubernetes Grid Integrated Edition


  • You have tried deleting a Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster by running tkgi delete-cluster <cluster-name> command.
  • You see that the bosh deployment was deleted but the cluster deletion from TKGI is still in progress for some time.
  • When you check the cluster status, you that is is Instance forced deletion in progress.
  • In the bosh recent tasks, you see drain-cluster, delete-deployment tasks were completed successfully, and drain-windows-cluster task triggered even though it is not a Windows cluster. The drain-windows-cluster task shows a status of failed.
  • In the pks-api.log on the PKS API vm, you see the timeout entries similar to the following:
2021-09-23 12:31:32.038 ERROR 609 — [  op-poll-sub-1] i.p.p.c.BrokerLastOperationPollingTask   : Could not update cluster ClusterEntity{name='test-1', uuid='e26a055c-8e3f-471e-86c7-a5bb10a4222c', owner='osb-pks-int-client', brokerOperationId='{"BoshTaskID":1177663,"BoshContextID":"2aa3ea3f-4dc3-4279-9e4d-c10023e1861d","OperationType":"force-delete","PostDeployErrand":{},"PreDeleteErrand":{},"Errands":[{"Name":"drain-cluster","Instances":null},{"Name":"drain-cluster-windows","Instances":null}]}',lastAction='DELETE', lastActionState='in progress', lastActionDescription='Instance forced deletion in progress', lastCompletedAction='DELETE', lastCompletedActionState='in progress', lastCompletedActionDescription='Instance forced deletion in progress', planId='8A0E21A8-8072-4D80-B365-D1F502085560', masterIps='[]',parameters=io.pivotal.pks.cluster.data.ClusterParametersEntity@2ecaf285',networkProfileUuid=5a7e8ec5-96fa-4ce3-ad93-eac9425a5dd5', computeProfileUuid=null', maintenanceInfo=MaintenanceInfoVO[publicProperties={docker=20.10.7, kubernetes=1.20.6, pks=1.11.2-build.2, stemcell=621.136}, privateProperty='null', version='null']', attributes={custom_ca=false, current_ca_pem=----BEGIN CERTIFICATE----feign.RetryableException: timeout executing GET https://localhost:3000/v2/service_instances/e26a055c-8e3f-471e-86c7-a5bb10a4222c/last_operation?operation=%7B%22BoshTaskID%22:1177663,%22BoshContextID%22:%222aa3ea3f-4dc3-4279-9e4d-c10023e1861d%22,%22OperationType%22:%22force-delete%22,%22PostDeployErrand%22:%7B%7D,%22PreDeleteErrand%22:%7B%7D,%22Errands%22:%5B%7B%22Name%22:%22drain-cluster%22,%22Instances%22:null%7D,%7B%22Name%22:%22drain-cluster-windows%22,%22Instances%22:null%7D%5D%7D
  • On the pks-api vm, you see that ncp_cleanup script spawned multiple times for the same cluster:
$ ps -ef |grep "ncp_cleanup”
root     13806 13308  0 14:16 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f  true
root     13916 13308  0 14:18 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f  true
root     14187 13308  0 14:21 ?        00:00:00 /bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup 504423c9-bed4-4c41-8c9a-2561ece22e0f  true


VMware Tanzu Kubernetes Grid Integrated Edition 1.x


This is a known issue affecting TKGI. There is currently no resolution.


To work around this issue, perform the following steps on the pks-api vm:

  1. Make a copy of the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file.
  2. Open the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file with a text editor.
  3. Comment out the following section:

$pksnsxcli cleanup \
  --nsx-manager-host='' \
  -c $nsx_manager_client_cert_file \
  -k $nsx_manager_client_key_file \
  --nsx-ca-cert-path=$nsx_manager_ca_cert_file \
  --insecure='false' \
  --cluster "$k8s_cluster_name" \
  --t0-router-id="$t0_router_id" \
  --pks=false \
  --read-only=false \

  1. Save and close the file.​​
  2. Wait for 11 minutes to see whether the cluster has been deleted successfully.
  3. Revert the change to the /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup file made in Step 2
  4. Search for the deleted cluster's UUID in NSX-T manager. If there are any instances of it remaining, run the following command to forcibly delete them from NSX-T:
/bin/bash /var/vcap/jobs/pks-nsx-t-osb-proxy/bin/ncp_cleanup <cluster-id> true