Scenario:
Unable to complete creation of a new Kubernetes guest cluster using TMC in an individual vSphere Namespace.
When using TMC, the guest cluster name is created on vSphere but no VMs for the control plane and worker nodes were not being created.
No Virtual Machines get created.
Attempts to create guest clusters in other vSphere Namespaces are successful.
Attempt to create guest cluster in the problematic vSphere Namespace manually using CLI result in the same failure.
vSphere IaaS Control Plane (vSphere with Tanzu)
NSX-T
This occurred because one of the two NSX nodes deployed was in maintenance mode. It is also a good idea to check for any expired NSX Certificates and to rotate those if found.
Errors from: kubectl describe cluster CLUSTER_NAME
Warning ReconcileFailure 17m (x10 over 17m) vmware-system-capw/vmware-system-capw-controller-manager/WCPCluster unexpected error while reconciling control plane endpoint for <VS NAMESPACE>: failed to reconcile loadbalanced endpoint for WCPCluster <VS NAMESPACE>/<VS CLUSTER>: failed to get control plane endpoint for Cluster <VS NAMESPACE>/<VS CLUSTER>: VirtualMachineService LB does not yet have VIP assigned: VirtualMachineService LoadBalancer does not have any Ingresses
Warning ReconcileFailure 59s (x10 over 17m) vmware-system-capw/vmware-system-capw-controller-manager/WCPCluster failed to configure cluster network for WCPCluster <VS NAMESPACE>/<VS CLUSTER>: virtual network ready status is: 'False' in cluster <VS NAMESPACE>/<VS CLUSTER>. reason: NetworkNotRealized, message:
1 wcpmachine_controller.go:315] vmware-system-capw-controller-manager/WCPMachine/<VS NAMESPACE>/<VS CLUSTER>/<VS CLUSTER>-default-nodepool-qzxvj-9jjmf "msg"="Waiting for the control plane to be initialized"
Errors from: vmware-system-capw-controller-manager log
E0114 18:58:45.235498 1 controller.go:317] controller/WCPCluster "msg"="Reconciler error" "error"="failed to configure cluster network for WCPCluster <VS NAMESPACE>/<VS CLUSTER>: virtual network ready status is: 'False' in cluster <VS NAMESPACE>/<VS CLUSTER>. reason: NetworkNotRealized, message: Cannot realize ne
twork" "name"="<NETWORK NAME>" "namespace"="<VS NAMESPACE>" "reconciler group"="infrastructure.cluster.vmware.com" "reconciler kind"="WCPCluster"
Errors from: capi-kubeadm-controller log
NOTE: In reference to no VMS, this shows that the initial CP Machine object never comes up:
I0114 18:24:44.682681 1 scale.go:212] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="CLUSTER_NAME" "name"="CLUSTER_NAME-control-plane" "namespace"="VS_NAMESPACE" "failures"="[machineCLUSTER_NAME-control-plane-5w92q reports ControllerManagerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine CLUSTER_NAME-control-plane-5w92q reports SchedulerPodHealthy condition is false (Info, Waiting for startup or readiness probes), machine CLUSTER_NAME-control-plane-8cnzp reports APIServerPodHealthy condition is false (Error, Missing node), machine CLUSTER_NAME-control-plane-8cnzp reports ControllerManagerPodHealthy condition is false (Error, Missing node), machine CLUSTER_NAME-control-plane-8cnzp reports SchedulerPodHealthy condition is false (Error, Missing node), machine CLUSTER_NAME-control-plane-8cnzp reports EtcdPodHealthy condition is false (Error, Missing node), machine CLUSTER_NAME-control-plane-8cnzp does not have EtcdMember
Error from: vmware-system-nsx_nsx-ncp log
2025-01-17T16:34:40.883504843Z stderr F [ncp GreenThread-50 W] nsx_ujo.ncp.nsx.policy.firewall_service Missing LB source IP for namespace VSPHERE_NAMESPACE
Solution is to bring the NSX Node out of maintenance mode.