Multiple clusters stuck in creating/updating state. Cluster state can be verified by using tanzu cluster list --include-management-cluster
$ tanzu cluster list --include-management-cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES PLAN TKR
cluster1 dev updating 3/3 10/11 v1.27.5+vmware.1-fips.1 <none> prod v1.27.5---vmware.1-fips.1-tkg.8
cluster2 dev creating 0/3 0/3 v1.26.8+vmware.1-fips.1 <none> prod v1.26.8---vmware.1-fips.1-tkg.3
cluster3 dev updating 3/3 2/3 v1.26.8+vmware.1-fips.1 <none> prod v1.26.8---vmware.1-fips.1-tkg.3
VMware Tanzu Kubernetes Grid
Cluster can be stuck in updating/creating due to infrastructure changes like esxi patch update.
Due to this machines get stuck in deleting/provisioning phase
Describe the stuck machine either in deleting/provisioning phase.
$ kubectl get ma -n namespace NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION cluster1-md-0-xk68pn-nx58 cluster cluster1-md-0-xk68pn-nx58q vsphere://421f... Running 139d v1.26.8+vmware.1-fips.1 cluster1-md-0-k68pn-q9gds cluster1 cluster1-md-0-k68pn-q9gds vsphere://4216... Deleting 139d v1.26.8+vmware.1-fips.1
$ kubectl describe ma -n <namespace> <machine-name>
Conditions: Last Transition Time: 2025-08-25T01:27:42Z Message: Condition Ready on node is reporting status Unknown for more than 5m0s Reason: UnhealthyNode Severity: Warning Status: False Type: Ready Last Transition Time: 2025-04-07T14:46:58Z Status: True Type: BootstrapReady Last Transition Time: 2025-08-25T01:32:55Z Status: True Type: DrainingSucceeded Last Transition Time: 2025-08-25T01:27:13Z Message: Condition Ready on node is reporting status Unknown for more than 5m0s Reason: UnhealthyNode Severity: Warning Status: False Type: HealthCheckSucceeded Last Transition Time: 2025-04-07T14:47:54Z Status: True Type: InfrastructureReady Last Transition Time: 2025-08-25T01:22:15Z Message: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown. Reason: NodeConditionsFailed Status: Unknown Type: NodeHealthy Last Transition Time: 2025-08-25T01:27:13Z Reason: WaitingForRemediation Severity: Warning Status: False Type: OwnerRemediated Last Transition Time: 2025-08-25T01:27:13Z Status: True Type: PreDrainDeleteHookSucceeded Last Transition Time: 2025-08-25T01:32:55Z Message: Waiting for node volumes to be detached Reason: WaitingForVolumeDetach Severity: Info Status: False Type: VolumeDetachSucceeded Infrastructure Ready: true Last Updated: 2025-08-25T01:27:42Z Phase: Deleting Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MachineMarkedUnhealthy 4m14s (x301 over 3h4m) machinehealthcheck-controller Machine <machine-name> has been marked as unhealthy Normal SuccessfulDrainNode 3m51s (x59 over 178m) machine-controller success draining Machine's node "<machine-name>"
SSH to the machine ssh capv@machineipaddr
Verify the status of kubelet and containerd using: systemctl status kubelet systemctl status containerd
Check whether the machine's IP is already assigned to another node in the environment.
Confirm if any pods are still running on the machine marked for deletion.
Once you've validated that the machine can be safely removed, proceed to delete it.
Use the --force option or remove the finalizer to complete the deletion.
kubectl delete ma -n namespace <machine-name> --force
If only 2 out of 3 control plane nodes are in a running state:
Describe the third machine that's stuck in the provisioning phase.
Review the events and status conditions to identify where it's getting blocked.
If it's stuck at IP allocation, check whether free IPs are available and confirm that etcd is still in quorum.
SSH into one of the healthy control plane nodes and verify etcd quorum status.
If any member is not part of the quorum, remove it to allow recovery.
In some cases, certificate mismatches under /etc/containerd/certs can cause issues:
Certificates may be expired or not aligned across nodes.
Confirm the correct CA certs with the customer.
If needed, edit the kcp object of the affected control plane machine and update it with the correct CA certificate.