Supervisor Cluster Deployment Fails or Remains Stuck in "Provisioning" or "Configuring" State
search cancel

Supervisor Cluster Deployment Fails or Remains Stuck in "Provisioning" or "Configuring" State

book

Article ID: 430866

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

When enabling Workload Management on a vSphere Cluster, the Supervisor Cluster deployment fails to reach a "Healthy" status. The progress remains indefinitely stuck in the "Provisioning" or "Configuring" state within the vCenter Server UI.

In this condition:

  • The Supervisor Cluster does not transition to a Healthy or Ready.
  • One or more Supervisor Control Plane VMs appear powered on but never transition to a "Ready" state.
  • The "Namespaces" tab in vCenter shows the Supervisor Cluster status as "Configuring."

Diagnostic Verification:

  • SSH into the affected Supervisor Control Plane VM.
  • Execute the following command to check the API server status: crictl ps -a --name kube-apiserver
  • Check the logs for binding errors: crictl logs <container-id>
  • Review the cloud-init output for network failures: cat /var/log/cloud-init-output.log

Typical log indicators include:

 1. Cloud-init Network Device Info

This snippet from the cloud-init output confirms that the node was provisioned with a missing network interface (eth1) and the existing interface (eth0) failed to initialize properly.

Cloud-init v. 25.1 running 'init'

ci-info: ++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++

ci-info: +--------+-------+-----------+-----------+-------+-------------------+

ci-info: | Device |  Up   |  Address  |    Mask   | Scope |     Hw-Address    |

ci-info: +--------+-------+-----------+-----------+-------+-------------------+

ci-info: |  eth0  | False |     .     |     .     |   .   | 00:50:56:xx:xx:xx |

ci-info: |   lo   | True  | 127.0.0.1 | 255.0.0.0 |  host |         .         |

ci-info: +--------+-------+-----------+-----------+-------+-------------------+ 

 

 2. API Server Container Logs

This snippet from the kube-apiserver logs identifies the specific failure to locate the required eth1 interface and the resulting connection refusal.

stderr F I0225 10:07:07.698075 1 options.go:221] external host was not specified, using 10.10.xx.xxx

stderr F W0225 10:07:07.701754 1 filtered_certs.go:38] Unable to locate interface specified in filter: eth1

stderr F W0225 10:07:07.831592 1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"

Additionally, verifying the VM hardware settings in vCenter will show that only one vNIC is attached instead of two.

Environment

VMware vSphere 7.x / 8.x

vSphere with Tanzu (Supervisor Cluster) and VMware vSphere Kubernetes Service (VKS)

VCF 4.x / 5.x

Cause

The failure is caused by an automated provisioning anomaly where the Control Plane VM is deployed with a single vNIC. A standard, successful Supervisor deployment requires two (2) vNICs:

        1. eth0: Management Network (Communication with vCenter/ESXi).

        2. eth1: Primary Workload Network (Communication for K8s API and Services).

Because the automation engine (ESX Agent Manager) fails to attach the second interface, the kube-apiserver service cannot bind to its designated Workload Network IP address. This results in the API server container crashing or entering an "Exited" state during the bootstrap process.

Resolution

Manual hardware modification of Supervisor Control Plane VMs is not supported and will not resolve the internal software configuration mismatch. A clean redeployment is required.

Step 1: Decommission the Failed Cluster

            Remove the partially deployed cluster to ensure all misconfigured components are cleanly removed.

Step 2: Verify Network Prerequisites

           Before re-attempting deployment, ensure the underlying infrastructure can support the dual-NIC requirement:

            IP Availability: Confirm both the Management and Workload IP pools have at least 5 available IP addresses per cluster (3 for Control Plane, 1 for VIP, 1 for upgrade headroom).

            Port Group Configuration: Ensure the Distributed Port Groups used for the Workload Network have correct VLAN tagging and are accessible by all hosts in the cluster.

Step 3: Recreate the Supervisor Cluster

           1. Initiate a fresh deployment from vCenter.

           2. The automation engine will provision all three Control Plane VMs with the required dual-vNIC configuration