Supervisor Cluster Deployment Fails or Remains Stuck in "Provisioning" or "Configuring" State

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

^{When enabling Workload Management on a vSphere Cluster, the Supervisor Cluster deployment fails to reach a "Healthy" status. The progress remains indefinitely stuck in the "Provisioning" or "Configuring" state within the vCenter Server UI.}

^{In this condition:}

^{The Supervisor Cluster does not transition to a Healthy or Ready.}
^{One or more Supervisor Control Plane VMs appear powered on but never transition to a "Ready" state.}
^{The "Namespaces" tab in vCenter shows the Supervisor Cluster status as "Configuring."}

^{Diagnostic Verification:}

^{SSH into the affected Supervisor Control Plane VM.}
^{Execute the following command to check the API server status: crictl ps -a --name kube-apiserver}
^{Check the logs for binding errors: crictl logs <container-id>}
^{Review the cloud-init output for network failures: cat /var/log/cloud-init-output.log}

^{Typical log indicators include:}

^{1. Cloud-init Network Device Info}

^{This snippet from the cloud-init output confirms that the node was provisioned with a missing network interface (eth1) and the existing interface (eth0) failed to initialize properly.}

^{Cloud-init v. 25.1 running 'init'}

^{ci-info: ++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++}

^{ci-info: +--------+-------+-----------+-----------+-------+-------------------+}

^{ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |}

^{ci-info: +--------+-------+-----------+-----------+-------+-------------------+}

^{ci-info: | eth0 | False | . | . | . | 00:50:56:xx:xx:xx |}

^{ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |}

^{ci-info: +--------+-------+-----------+-----------+-------+-------------------+}

^{2. API Server Container Logs}

^{This snippet from the kube-apiserver logs identifies the specific failure to locate the required eth1 interface and the resulting connection refusal.}

^{stderr F I0225 10:07:07.698075 1 options.go:221] external host was not specified, using 10.10.xx.xxx}

^{stderr F W0225 10:07:07.701754 1 filtered_certs.go:38] Unable to locate interface specified in filter: eth1}

^{stderr F W0225 10:07:07.831592 1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1:2379", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"}

^{Additionally, verifying the VM hardware settings in vCenter will show that only one vNIC is attached instead of two.}

Environment

^{VMware vSphere 7.x / 8.x}

^{vSphere with Tanzu (Supervisor Cluster) and VMware vSphere Kubernetes Service (VKS)}

^{VCF 4.x / 5.x}

Cause

^{The failure is caused by an automated provisioning anomaly where the Control Plane VM is deployed with a single vNIC. A standard, successful Supervisor deployment requires two (2) vNICs:}

^{1. eth0: Management Network (Communication with vCenter/ESXi).}

^{2. eth1: Primary Workload Network (Communication for K8s API and Services).}

^{Because the automation engine (ESX Agent Manager) fails to attach the second interface, the kube-apiserver service cannot bind to its designated Workload Network IP address. This results in the API server container crashing or entering an "Exited" state during the bootstrap process.}

Resolution

^{Manual hardware modification of Supervisor Control Plane VMs is not supported and will not resolve the internal software configuration mismatch. A clean redeployment is required.}

^{Step 1: Decommission the Failed Cluster}

^{Remove the partially deployed cluster to ensure all misconfigured components are cleanly removed.}

^{Step 2: Verify Network Prerequisites}

^{Before re-attempting deployment, ensure the underlying infrastructure can support the dual-NIC requirement:}

^{IP Availability: Confirm both the Management and Workload IP pools have at least 5 available IP addresses per cluster (3 for Control Plane, 1 for VIP, 1 for upgrade headroom).}

^{Port Group Configuration: Ensure the Distributed Port Groups used for the Workload Network have correct VLAN tagging and are accessible by all hosts in the cluster.}

^{Step 3: Recreate the Supervisor Cluster}

^{1. Initiate a fresh deployment from vCenter.}

^{2. The automation engine will provision all three Control Plane VMs with the required dual-vNIC configuration}