CaaS cluster operations do not complete in vSphere
search cancel

CaaS cluster operations do not complete in vSphere

book

Article ID: 314266

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Tanzu Kubernetes Grid

Issue/Introduction

  • After vCenter upgrades, patching, or some other vCenter maintenance operations, new Telco Cloud Automation (TCA) cluster operations fail or do not complete. 
  • When provisioning TKG workload clusters, nodes do not change to ready status.
  • No VMs or only 1 VM created in vSphere after initiating a workload cluster deploy or after scaling out the workload cluster
  • TKG VMs on existing clusters will not delete or roll with new VMs.
  • No tasks in vCenter are showing for the creation of TKG VMs or template clones.
  • vSphereVM and machine objects in the management cluster context for the workload cluster do not show a provider ID.
  • Creation of workload cluster VMs are stuck indefinitely.
  • Cluster stuck in provisioning.
  • CAPV controller logs show similar to the following message repeating over and over:
    I0924 21:50:02.070872    1 vimmachine.go:###] "capv-controller-manager/vspheremachine-controller/Cluster-NameSpace/Cluster-Name-control-plane-#####-#####: waiting for ready state"

Environment

TKG 2.x

TCA 2.x, 3.x

Cause

This is due to a disconnect between CAPV and the vCenter API. CAPV is unable to restore connectivity in some instances. This is a known issue in the Cluster API Provider for vSphere (CAPV). Please refer to TKG VMs Not Provisioned in vSphere - status.ready not found vSphereVM for additional details.

Resolution

Fixed in TCA 3.0+ with TKG 2.3.1+

Workaround:
Restart the capv controller. There is no impact on existing clusters.

  • In the TCA Management cluster context, run the following command to restart the CAPV  and CAPI Deployments:
    kubectl rollout restart deploy/capi-controller-manager -n capi-system
    kubectl rollout restart deploy/capv-controller-manager -n capv-system
    kubectl rollout restart deploy/capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system
    kubectl rollout restart deploy/capi-kubeadm-bootstrap-controller-manager -n capi-kubeadm-bootstrap-system



Additional Information

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1949 - Public upstream issue