TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM
search cancel

TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM

book

Article ID: 342246

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
  • When provisioning workload clusters on TKG the cluster never goes into ready status.
  • No VMs are being created in vSphere after initiating a workload cluster deploy or after scaling out the workload cluster
  • TKG VMs on existing clusters will not delete or roll with new VMs.
  • No tasks in vCenter are showing for the creation of TKG VMs or template clones.
  • vSphereVM and machine objects in the management cluster context for the workload cluster never get a provider ID.
  • Creation of workload cluster VMs are stuck indefinitely.
  • Cluster stuck in provisioning.
  • CAPV controller logs show similar to the following message repeating over and over:

I0924 21:50:02.070872    1 vimmachine.go:147] "capv-controller-manager/vspheremachine-controller/<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-control-plane-gnx88-htrkg: waiting for ready state" 

I0924 21:50:02.071795    1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/tkg-system/<MGMT-CLUSTER-NAME>-md-1-infra-g5jrq-txq7c: updated vm" vm="tkg-system/<MGMT-CLUSTER-NAME>-md-1-9fcbn-65kv57b5b-bdbbf" 

I0924 21:50:02.071883    1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-md-1-infra-kgl22-5w2wm: updated vm" vm="<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-md-1-g96v4-998dfc6c7f9-mqh9k" 


Environment

VMware Tanzu Kubernetes Grid 1.x
VMware Tanzu Kubernetes Grid 2.2.0
VMware Tanzu Kubernetes Grid 2.1.0

Cause

  • This issue can be caused by random network disconnects or temporarily unavailability of vCenter to the TKG Management cluster. This is a known issue with upstream Cluster API Provider vSphere.

Resolution

  • This issue will be resolved in a future version of TKG that will include the cluster-api-provider-vsphere release with the fix
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue
Fix is now included in https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/releases  v1.5.6 - v1.7.1 - v1.8.0

Workaround:

Restart the capv controller. There should be no impact on existing clusters:

In the TKG Management cluster context run the following to collect the CAPV controller deployment info and record the deployment name and namespace of the CAPV controller:

 

kubectl get deployments -A | grep capv

 

Restart the CAPV controller with the following command:

 

kubectl rollout restart deployment -n vmware-system-capv capv-controller-manager 
OR

kubectl rollout restart deployment -n capv-system capv-controller-manager

 

Validate the CAPV controller pods are back up in ready status using the following command:

 

kubectl get pods -A | grep capv

 

Validate there are now VMs being provisioned or existing VMs are being removed and replaced in vCenter UI


Additional Information

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue

Impact/Risks:
  • Unable to provision new workload clusters or new VMs for existing clusters