TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM

search cancel

TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM

book

Article ID: 342246

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

When provisioning workload clusters on TKG the cluster never goes into ready status.
No VMs are being created in vSphere after initiating a workload cluster deploy or after scaling out the workload cluster
TKG VMs on existing clusters will not delete or roll with new VMs.
No tasks in vCenter are showing for the creation of TKG VMs or template clones.
vSphereVM and machine objects in the management cluster context for the workload cluster never get a provider ID.
Creation of workload cluster VMs are stuck indefinitely.
Cluster stuck in provisioning.
CAPV controller logs show similar to the following message repeating over and over:

I0924 21:50:02.070872 1 vimmachine.go:147] "capv-controller-manager/vspheremachine-controller/<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-control-plane-gnx88-htrkg: waiting for ready state"

I0924 21:50:02.071795 1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/tkg-system/<MGMT-CLUSTER-NAME>-md-1-infra-g5jrq-txq7c: updated vm" vm="tkg-system/<MGMT-CLUSTER-NAME>-md-1-9fcbn-65kv57b5b-bdbbf"

I0924 21:50:02.071883 1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-md-1-infra-kgl22-5w2wm: updated vm" vm="<CLUSTER-NAMESPACE>/<CLUSTER-NAME>-md-1-g96v4-998dfc6c7f9-mqh9k"

Environment

VMware Tanzu Kubernetes Grid 1.x
VMware Tanzu Kubernetes Grid 2.2.0
VMware Tanzu Kubernetes Grid 2.1.0

Cause

This issue can be caused by random network disconnects or temporarily unavailability of vCenter to the TKG Management cluster. This is a known issue with upstream Cluster API Provider vSphere.

Resolution

This issue will be resolved in a future version of TKG that will include the cluster-api-provider-vsphere release with the fix

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue
Fix is now included in https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/releases v1.5.6 - v1.7.1 - v1.8.0

Workaround:

Restart the capv controller. There should be no impact on existing clusters:

In the TKG Management cluster context run the following to collect the CAPV controller deployment info and record the deployment name and namespace of the CAPV controller:

kubectl get deployments -A | grep capv

Restart the CAPV controller with the following command:

kubectl rollout restart deployment -n vmware-system-capv capv-controller-manager
OR
kubectl rollout restart deployment -n capv-system capv-controller-manager

Validate the CAPV controller pods are back up in ready status using the following command:

kubectl get pods -A | grep capv

Validate there are now VMs being provisioned or existing VMs are being removed and replaced in vCenter UI

Additional Information

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue

Impact/Risks:

Unable to provision new workload clusters or new VMs for existing clusters

Feedback

thumb_up Yes

thumb_down No