CaaS cluster operations do not complete in vSphere
search cancel

CaaS cluster operations do not complete in vSphere

book

Article ID: 314266

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

Symptoms:
After vCenter upgrades, patching, or some other vCenter maintenance operations, new Telco Cloud Automation (TCA) cluster operations fail or do not complete. 

Other symptoms may include:

  • When provisioning TKG workload clusters, nodes do not change to ready status.
  • No VMs are being created in vSphere after initiating a workload cluster deploy or after scaling out the workload cluster
  • TKG VMs on existing clusters will not delete or roll with new VMs.
  • No tasks in vCenter are showing for the creation of TKG VMs or template clones.
  • vSphereVM and machine objects in the management cluster context for the workload cluster do not show a provider ID.
  • Creation of workload cluster VMs are stuck indefinitely.
  • Cluster stuck in provisioning.
  • CAPV controller logs show similar to the following message repeating over and over:
    I0924 21:50:02.070872    1 vimmachine.go:147] "capv-controller-manager/vspheremachine-controller/Cluster-NameSpace/Cluster-Name-control-plane-gnx88-htrkg: waiting for ready state"

Environment

2.x

Cause

This is due to a disconnect between CAPV and the vCenter API. CAPV is unable to restore connectivity in some instances. This is a known issue in the Cluster API Provider for vSphere (CAPV). Please refer to TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM for additional details.

Resolution

Resolved

TCA 3.0+ with TKG 2.3.1+

 

Workaround:

Restart the capv controller. There is no impact on existing clusters.

  • In the TKG Management cluster context, run the following command to restart the CAPV controller:
    kubectl rollout restart deploy/capv-controller-manager -n capv-system



Additional Information

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue