TKGm Multiple nodes stuck in Deleting or Provisioning phase

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Multiple clusters stuck in creating/updating state. Cluster state can be verified by using tanzu cluster list --include-management-cluster

$ tanzu cluster list --include-management-cluster

NAME             NAMESPACE       STATUS    CONTROLPLANE  WORKERS  KUBERNETES               ROLES           PLAN  TKR
cluster1         dev             updating  3/3           10/11    v1.27.5+vmware.1-fips.1  <none>          prod  v1.27.5---vmware.1-fips.1-tkg.8
cluster2         dev             creating  0/3           0/3      v1.26.8+vmware.1-fips.1  <none>          prod  v1.26.8---vmware.1-fips.1-tkg.3
cluster3         dev             updating  3/3           2/3      v1.26.8+vmware.1-fips.1  <none>          prod  v1.26.8---vmware.1-fips.1-tkg.3

Environment

VMware Tanzu Kubernetes Grid

Cause

Cluster can be stuck in updating/creating due to infrastructure changes like esxi patch update.

Due to this machines get stuck in deleting/provisioning phase

Resolution

Describe the stuck machine either in deleting/provisioning phase.

$ kubectl get ma -n namespace
NAME                            CLUSTER     NODENAME                    PROVIDERID          PHASE      AGE     VERSION
cluster1-md-0-xk68pn-nx58       cluster     cluster1-md-0-xk68pn-nx58q  vsphere://421f...   Running    139d    v1.26.8+vmware.1-fips.1
cluster1-md-0-k68pn-q9gds       cluster1    cluster1-md-0-k68pn-q9gds   vsphere://4216...   Deleting   139d    v1.26.8+vmware.1-fips.1

$ kubectl describe ma -n <namespace> <machine-name>    
Conditions:
    Last Transition Time:  2025-08-25T01:27:42Z
    Message:               Condition Ready on node is reporting status Unknown for more than 5m0s
    Reason:                UnhealthyNode
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2025-04-07T14:46:58Z
    Status:                True
    Type:                  BootstrapReady
    Last Transition Time:  2025-08-25T01:32:55Z
    Status:                True
    Type:                  DrainingSucceeded
    Last Transition Time:  2025-08-25T01:27:13Z
    Message:               Condition Ready on node is reporting status Unknown for more than 5m0s
    Reason:                UnhealthyNode
    Severity:              Warning
    Status:                False
    Type:                  HealthCheckSucceeded
    Last Transition Time:  2025-04-07T14:47:54Z
    Status:                True
    Type:                  InfrastructureReady
    Last Transition Time:  2025-08-25T01:22:15Z
    Message:               Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown.
    Reason:                NodeConditionsFailed
    Status:                Unknown
    Type:                  NodeHealthy
    Last Transition Time:  2025-08-25T01:27:13Z
    Reason:                WaitingForRemediation
    Severity:              Warning
    Status:                False
    Type:                  OwnerRemediated
    Last Transition Time:  2025-08-25T01:27:13Z
    Status:                True
    Type:                  PreDrainDeleteHookSucceeded
    Last Transition Time:  2025-08-25T01:32:55Z
    Message:               Waiting for node volumes to be detached
    Reason:                WaitingForVolumeDetach
    Severity:              Info
    Status:                False
    Type:                  VolumeDetachSucceeded
  Infrastructure Ready:    true
  Last Updated:            2025-08-25T01:27:42Z
  Phase:                Deleting
Events:
  Type    Reason                  Age                     From                           Message
  ----    ------                  ----                    ----                           -------
  Normal  MachineMarkedUnhealthy  4m14s (x301 over 3h4m)  machinehealthcheck-controller  Machine <machine-name> has been marked as unhealthy
  Normal  SuccessfulDrainNode     3m51s (x59 over 178m)   machine-controller             success draining Machine's node "<machine-name>"

SSH to the machine ssh capv@machineipaddr

Verify the status of kubelet and containerd using: systemctl status kubelet systemctl status containerd

Check whether the machine's IP is already assigned to another node in the environment.
Confirm if any pods are still running on the machine marked for deletion.
Once you've validated that the machine can be safely removed, proceed to delete it.
Use the --force option or remove the finalizer to complete the deletion.
```
kubectl delete ma -n namespace <machine-name> --force
```
If only 2 out of 3 control plane nodes are in a running state:
- Describe the third machine that's stuck in the provisioning phase.
- Review the events and status conditions to identify where it's getting blocked.
- If it's stuck at IP allocation, check whether free IPs are available and confirm that etcd is still in quorum.
- SSH into one of the healthy control plane nodes and verify etcd quorum status.
  - If any member is not part of the quorum, remove it to allow recovery.
In some cases, certificate mismatches under /etc/containerd/certs can cause issues:
- Certificates may be expired or not aligned across nodes.
- Confirm the correct CA certs with the customer.
- If needed, edit the kcp object of the affected control plane machine and update it with the correct CA certificate.

TKGm Multiple nodes stuck in Deleting or Provisioning phase

Article ID: 408413

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback