Node in not ready state and machine showing failed or stuck in provisioning state.
search cancel

Node in not ready state and machine showing failed or stuck in provisioning state.

book

Article ID: 378405

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

  • Machine resource in TKG might show as failed or are stuck in provisioning state

Environment

2.3

Cause

  • Some times after an unexpected reboot of nodes or the cluster you will see nodes in not ready state.
  • Due to some application/workload issue, the node(s) were set to not ready state. 
  • The Machine resource is still not provisioning due to a pod not being in a ready status.

Resolution

Remove the machine, vspheremachine and vspherevm resources and let the capi/capv recreate these resources using the following steps:

  1. Identify the machine resource in the tkg management cluster:
    kubectl get machine -n NAMESPACE | grep MACHINENAME
  2. Delete machine resource

    kubectl delete machine MACHINENAME -n NAMESPACE


    NOTE: If step 2 does not auto provision a new machine in running status follow the next steps. 

  3. Delete vspheremachine resource

    kubectl delete vspheremachine VSPHEREMACHINENAME -n NAMESPACE
  4. Delete vspherevm resource

    kubectl delete vspherevm VSPHEREVMNAME -n NAMESPACE

 

Additional Information

After the removal of the machines from the cli, the nodes may get auto created ( reason being the Machine Health check being enabled).
To synch the exact replicas for the nodes, please edit the cluster configuration from the TCA-M GUI with the correct node count of the replicas and wait for the nodes to get provisioned.