Safe Cleanup of Degraded Tanzu Guest Clusters Using kubectl delete cluster
search cancel

Safe Cleanup of Degraded Tanzu Guest Clusters Using kubectl delete cluster

book

Article ID: 425920

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • When attempting to delete a degraded vSphere with Tanzu (VKS) guest cluster using the standard kubectl delete cluster <name> command, the operation appears to hang indefinitely.
  • Users observe that Machines, VMs, and related resources are not being cleaned up, and CAPI controller logs show repeated errors during the deletion process.
  • This behavior is expected and by design for Cluster API (CAPI) in vSphere with Tanzu environments, where multiple safety mechanisms and pre-terminate hooks ensure orderly cleanup even from degraded states.
  • Initiating the deletion returns no immediate feedback, and the terminal appears to hang.
  • Inspecting the capi-controller-manager pods typically reveals connectivity errors as the manager attempts to reach the failing workload cluster:

E0120 HH:MM:SS.145826 controller.go:353 "Reconciler error" 
err="error creating watch machine-watchNodes: connection to the workload cluster is down" 

I0120 HH:MM:SS.628072 machine_controller.go:565 "Waiting for pre-terminate hooks to succeed" 
hooks="pre-terminate.delete.hook.machine.cluster.x-k8s.io/tkg.tanzu.vmware.com"

  • Checking the state of child objects will show them stuck in a Deleting or Terminating phase:

kubectl get machines,machinedeployments -n <namespace>

kubectl get machines.vsphere.tkg.tanzu.vmware.com -n <namespace>

Environment

  • VMware vSphere with Kubernetes

Cause

  • The delay is caused by a multi-phase deletion process governed by the following safeguards:
    • Cluster Finalizers: The primary Cluster object contains finalizers that prevent its deletion until every child resource (Machines, MachineSets, and Deployments) has been successfully purged.
    • Pre-terminate Hooks: TKG-specific hooks execute essential cleanup tasks, including:
      • Coordinating with the vSphere ESX Agent Manager (EAM) for VM lifecycle teardown.
      • Removing LoadBalancer services from Avi (NSX ALB) or NSX-T.
      • Detaching persistent storage volumes and cleaning up network policies.
    • Connectivity Timeouts: Controllers attempt to validate the Node state. If the workload cluster is down, the controller must wait for internal timeouts before proceeding to the next phase of "orphaned" resource cleanup.

Resolution

  • The primary recommendation is to wait for natural completion. In degraded environments, this process typically takes between 15 and 60 minutes.
  • Open a separate terminal to track the cleanup in real-time:
    • Monitor Finalizers:

watch "kubectl get cluster <cluster-name> -n <namespace> -o yaml | grep -A5 finalizers"

    • Track Machine Deletion:

watch "kubectl get machines,vspheremachines -n <namespace>"

    • Stream Controller Logs:

kubectl logs -f deployment/capi-controller-manager -n capi-system -c manager | grep <cluster-name>

  • If the cluster remains in the inventory after two hours, investigate blocking resources manually.
    • Check for persistent resources that may be preventing the finalizers from clearing: 

kubectl get svc -A | grep LoadBalancer

kubectl get pvc -A

kubectl get machines -n <namespace> --show-labels | grep Terminating

  • If it is confirmed that the physical resources are gone but the metadata remains, patch the objects:

kubectl patch cluster <cluster-name> -n <namespace> -p '{"metadata":{"finalizers":null}}' --type=merge

kubectl patch machine <machine-name> -n <namespace> -p '{"metadata":{"finalizers":null}}' --type=merge

Additional Information

  • The kubectl delete cluster command for degraded VKS guest clusters does not hang - it executes a deliberate, multi-phase cleanup process with safety hooks that can take 15-60+ minutes to complete.
  • Waiting patiently is the correct resolution, allowing CAPI controllers to safely remove Machines, VMs, LoadBalancer services, and clear all finalizers.
  • Manual intervention is rarely needed, and force finalizer removal should only be used as a last resort after confirming no blocking resources remain.