TKGm Multiple nodes stuck in Deleting or Provisioning phase
search cancel

TKGm Multiple nodes stuck in Deleting or Provisioning phase

book

Article ID: 408413

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Multiple clusters stuck in creating/updating state.  Cluster state can be verified by using tanzu cluster list --include-management-cluster

$ tanzu cluster list --include-management-cluster
NAME             NAMESPACE       STATUS    CONTROLPLANE  WORKERS  KUBERNETES               ROLES           PLAN  TKR
cluster1         dev             updating  3/3           10/11    v1.27.5+vmware.1-fips.1  <none>          prod  v1.27.5---vmware.1-fips.1-tkg.8
cluster2         dev             creating  0/3           0/3      v1.26.8+vmware.1-fips.1  <none>          prod  v1.26.8---vmware.1-fips.1-tkg.3
cluster3         dev             updating  3/3           2/3      v1.26.8+vmware.1-fips.1  <none>          prod  v1.26.8---vmware.1-fips.1-tkg.3

Environment

VMware Tanzu Kubernetes Grid

Cause

Cluster can be stuck in updating/creating due to infrastructure changes like esxi patch update.

Due to this machines get stuck in deleting/provisioning phase

Resolution

Describe the stuck machine either in deleting/provisioning phase.

$ kubectl get ma -n namespace
NAME                            CLUSTER     NODENAME                    PROVIDERID          PHASE      AGE     VERSION
cluster1-md-0-xk68pn-nx58       cluster     cluster1-md-0-xk68pn-nx58q  vsphere://421f...   Running    139d    v1.26.8+vmware.1-fips.1
cluster1-md-0-k68pn-q9gds       cluster1    cluster1-md-0-k68pn-q9gds   vsphere://4216...   Deleting   139d    v1.26.8+vmware.1-fips.1
$ kubectl describe ma -n <namespace> <machine-name>    
Conditions: Last Transition Time: 2025-08-25T01:27:42Z Message: Condition Ready on node is reporting status Unknown for more than 5m0s Reason: UnhealthyNode Severity: Warning Status: False Type: Ready Last Transition Time: 2025-04-07T14:46:58Z Status: True Type: BootstrapReady Last Transition Time: 2025-08-25T01:32:55Z Status: True Type: DrainingSucceeded Last Transition Time: 2025-08-25T01:27:13Z Message: Condition Ready on node is reporting status Unknown for more than 5m0s Reason: UnhealthyNode Severity: Warning Status: False Type: HealthCheckSucceeded Last Transition Time: 2025-04-07T14:47:54Z Status: True Type: InfrastructureReady Last Transition Time: 2025-08-25T01:22:15Z Message: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown. Reason: NodeConditionsFailed Status: Unknown Type: NodeHealthy Last Transition Time: 2025-08-25T01:27:13Z Reason: WaitingForRemediation Severity: Warning Status: False Type: OwnerRemediated Last Transition Time: 2025-08-25T01:27:13Z Status: True Type: PreDrainDeleteHookSucceeded Last Transition Time: 2025-08-25T01:32:55Z Message: Waiting for node volumes to be detached Reason: WaitingForVolumeDetach Severity: Info Status: False Type: VolumeDetachSucceeded Infrastructure Ready: true Last Updated: 2025-08-25T01:27:42Z Phase: Deleting Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal MachineMarkedUnhealthy 4m14s (x301 over 3h4m) machinehealthcheck-controller Machine <machine-name> has been marked as unhealthy Normal SuccessfulDrainNode 3m51s (x59 over 178m) machine-controller success draining Machine's node "<machine-name>"

SSH to the machine ssh capv@machineipaddr

Verify the status of kubelet and containerd using: systemctl status kubelet systemctl status containerd
  • Check whether the machine's IP is already assigned to another node in the environment.

  • Confirm if any pods are still running on the machine marked for deletion.

  • Once you've validated that the machine can be safely removed, proceed to delete it.

  • Use the --force option or remove the finalizer to complete the deletion.

    kubectl delete ma -n namespace <machine-name> --force
  • If only 2 out of 3 control plane nodes are in a running state:

    • Describe the third machine that's stuck in the provisioning phase.

    • Review the events and status conditions to identify where it's getting blocked.

    • If it's stuck at IP allocation, check whether free IPs are available and confirm that etcd is still in quorum.

    • SSH into one of the healthy control plane nodes and verify etcd quorum status.

      • If any member is not part of the quorum, remove it to allow recovery.

    In some cases, certificate mismatches under /etc/containerd/certs can cause issues:

    • Certificates may be expired or not aligned across nodes.

    • Confirm the correct CA certs with the customer.

    • If needed, edit the kcp object of the affected control plane machine and update it with the correct CA certificate.