Guest cluster upgrade is stuck with VM in provisioning state

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When upgrading the guest cluster, the first control plane deployment gets stuck at provisioning stage.
When checking summary of the newly created VM in the vSphere UI, it is in a powered off state with the error 'there is no network adapter assigned'.
When describing the cluster, it shows the information below:
$ kubectl describe cluster <cluster_name> -n <namespace>
Reason: MachineDeploymentsUpgradePending
Severity: Info
Status: False
Type: TopologyReconciled
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal TopologyCreate 51m topology/cluster Created "VSphereMachineTemplate/<cluster_name>-control-plane-####" as a replacement for "<cluster_name>-control-plane-####"" (template rotation)
Normal TopologyUpdate 51m topology/cluster Updated "KubeadmControlPlane/<cluster_name>-####" with version change from v1.##.##+vmware.1-fips.1 to v1.##.##+vmware.1-fips.1
When describing the VM stuck at provisioning, it shows the below warning:
$ kubectl describe vm <VM_name> -n <namespace>
Warning CreateOrUpdateFailure 4m26s (x27 over 71m) vmware-system-vmop/vmware-system-vmop-controller-manager-######-####/virtualmachine-controller timed out waiting for the condition
The events in the KCP of the affected cluster shows the below warnings:
$ kubectl describe kcp -n <namespace> | grep <cluster_name>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ControlPlaneUnhealthy 106s (x4 over 11m) kubeadm-control-plane-controller Waiting for control plane to pass preflight checks to continue reconciliation: [machine <cluster_name>-####-#### does not have APIServerPodHealthy condition, machine <cluster_name>-####-#### does not have ControllerManagerPodHealthy condition, machine <cluster_name>-####-#### does not have SchedulerPodHealthy condition, machine <cluster_name>-####-#### does not have EtcdPodHealthy condition, machine <cluster_name>-####-#### does not have EtcdMemberHealthy condition]
The below errors are present in the vmop-controller pod logs of the supervisor:
$ kubectl logs <vmop-controller-pod> -n vmware-system-vmop
#### HH:MM:SS 1 network_provider.go:595] vsphere "msg"="Successfully created VirtualNetworkInterface" "name"={"Namespace":"<namespace>","Name":"<cluster_name>-####-vnet-<cluster_name>-####-####"} "vmName"="<namespace>/<cluster_name>-####-####"
#### HH:MM:SS 1 network_provider.go:688] vsphere "msg"="Failed to create vnetIf for vif" "error"="timed out waiting for the condition" "vif"={"networkType":"nsx-t","networkName":"<cluster_name>-####-####-vnet"} "vmName"="<namespace>/<cluster_name>-####-####"
When checking the Virtual Network Interfaces of the new VM, the status is blank and there are no Interface related information.
$ kubectl describe virtualnetworkinterfaces.vmware.com <virtualnetworkinterface-name> -n <namespace>
Status:
Events: <none>
The NSX-NCP pods in the supervisor are in crashloop state and has below errors in logs.
$ kubectl logs <nsx-ncp-pod> -n vmware-system-nsx
[ncp #### W] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[UP] https://<nsx-manager-vip>:443' due to: Unexpected error from backend manager ([]) for GET https://<nsx-manager-vip>:443/api/v1/reverse-proxy/node/health: {'healthy': False, 'components_health': 'SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP'}
[ncp MainThread W] vmware_nsxlib.v3.cluster [####] Request failed due to: Some appliance components are not functioning properly. details: Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP
[ncp MainThread W] vmware_nsxlib.v3.utils Finished retry of vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal for the 1st time after 0.049(s) with args: Unknown
[ncp MainThread W] vmware_nsxlib.v3.utils Retrying call to 'vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal' for the 2nd time

Environment

vCenter 8.x

vSphere Kubernetes service

Cause

The newly provisioned VM does not get an IP address as the NSX-NCP pods were unable to reach the NSX manager.

Resolution

Ensure that the NSX Manager connectivity is verified and fully operational.

Guest cluster upgrade is stuck with VM in provisioning state

Article ID: 419449

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback