Cluster creation stuck in pending state due to capi-controller-manager in crashloop state
search cancel

Cluster creation stuck in pending state due to capi-controller-manager in crashloop state

book

Article ID: 416442

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • During cluster creation operations, the capi-controller-manager fails to retrieve the clusterclass from the runtime-extension pods. This lack of communication prevents the cluster from completing the provisioning cycle, leaving it stuck in a pending state where virtual machines loop between provisioning and deleting.
  • The capi-controller-manager pods enter a CrashLoopBackOff state.
  • The following error is continuously written to the capi-controller-manager pod logs:

    "Reconciler error" err="failed to discover variables for ClusterClass builtin-generic-v3.#.#: failed to call DiscoverVariables for patch default: failed to call extension handler \"discover-variables.runtime-extension\": failed to get extension handler \"discover-variables.runtime-extension\" from registry: invalid operation: Get cannot be called on a registry not yet ready" controller="clusterclass"

  • The following error is also written to the tkg-controller pod logs:

    failed to list *v1alpha1.ClusterVirtualMachineImage: conversion webhook for vmoperator.vmware.com/v1alpha2, Kind=ClusterVirtualMachineImage failed: Post "https://vmware-system-vmop-webhook-service.vmware-system-vmop.svc:443/convert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Environment

  • VMware vSphere 8.0 U3
  • vSphere with Tanzu (Supervisor)

Cause

The runtime-extension-controller-manager fails to securely reach the Supervisor node, typically due to stuck, pending, or invalid internal certificates. This severs the registry connection required by the capi-controller-manager to discover variables and build the clusterclass.

Resolution

Execute the following steps to forcefully refresh the controller certificates and restart the management pods:

  1. Log in to the Supervisor Cluster control plane SSH session.
  2. Reboot the Supervisor control plane using the below command.

    reboot

  3. Once the Supervisor node returns online, identify the specific namespace hosting the runtime-extension-controller-manager deployment:

    kubectl get deployments -A | grep runtime-extension-controller-manager

  4. Issue a rollout restart for the runtime extension deployment using the namespace identified in the previous step.

    kubectl rollout restart deployment runtime-extension-controller-manager -n <INSERT_NAMESPACE>

  5. Verify the pods are initializing and wait for them to show a Running status:

    kubectl get pods -n <INSERT_NAMESPACE> | grep runtime-extension

  6. Monitor the capi-controller-manager logs to confirm the CrashLoopBackOff state has cleared and cluster creation has resumed.