Guest cluster upgrade is stuck with VM in provisioning state
search cancel

Guest cluster upgrade is stuck with VM in provisioning state

book

Article ID: 419449

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • When upgrading the guest cluster, the first control plane deployment gets stuck at provisioning stage.
  • When checking summary of the newly created VM in the vSphere UI, it is in a powered off state with the error 'there is no network adapter assigned'.
  • When describing the cluster, it shows the information below:
    $ kubectl describe cluster <cluster_name> -n <namespace>
       Reason:                MachineDeploymentsUpgradePending
        Severity:              Info
        Status:                False
        Type:                  TopologyReconciled
    Events:
    Type    Reason          Age   From              Message
    ----    ------          ----  ----              -------
    Normal  TopologyCreate  51m   topology/cluster  Created "VSphereMachineTemplate/<cluster_name>-control-plane-####" as a      replacement for "<cluster_name>-control-plane-####"" (template rotation)
    Normal  TopologyUpdate  51m   topology/cluster  Updated "KubeadmControlPlane/<cluster_name>-####" with version change from v1.##.##+vmware.1-fips.1 to v1.##.##+vmware.1-fips.1

  • When describing the VM stuck at provisioning, it shows the below warning:
    $ kubectl describe vm <VM_name> -n <namespace>
     Warning  CreateOrUpdateFailure  4m26s (x27 over 71m)  vmware-system-vmop/vmware-system-vmop-controller-manager-######-####/virtualmachine-controller  timed out waiting for the condition
  • The events in the KCP of the affected cluster shows the below warnings:
    $ kubectl describe kcp -n <namespace> | grep <cluster_name>
    Events:
      Type     Reason                 Age                    From                              Message
      ----     ------                 ----                   ----                              -------
      Warning  ControlPlaneUnhealthy  106s (x4 over 11m)  kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine <cluster_name>-####-#### does not have APIServerPodHealthy condition, machine <cluster_name>-####-#### does not have ControllerManagerPodHealthy condition, machine <cluster_name>-####-#### does not have SchedulerPodHealthy condition, machine <cluster_name>-####-#### does not have EtcdPodHealthy condition, machine <cluster_name>-####-#### does not have EtcdMemberHealthy condition]

  • The below errors are present in the vmop-controller pod logs of the supervisor:
    $ kubectl logs <vmop-controller-pod> -n vmware-system-vmop
    #### HH:MM:SS       1 network_provider.go:595] vsphere "msg"="Successfully created VirtualNetworkInterface" "name"={"Namespace":"<namespace>","Name":"<cluster_name>-####-vnet-<cluster_name>-####-####"} "vmName"="<namespace>/<cluster_name>-####-####"
    #### HH:MM:SS       1 network_provider.go:688] vsphere "msg"="Failed to create vnetIf for vif" "error"="timed out waiting for the condition" "vif"={"networkType":"nsx-t","networkName":"<cluster_name>-####-####-vnet"} "vmName"="<namespace>/<cluster_name>-####-####"

  • When checking the Virtual Network Interfaces of the new VM, the status is blank and there are no Interface related information.
    $ kubectl describe virtualnetworkinterfaces.vmware.com <virtualnetworkinterface-name> -n <namespace>
    Status:
    Events:  <none>

  • The NSX-NCP pods in the supervisor are in crashloop state and has below errors in logs.
    $ kubectl logs <nsx-ncp-pod> -n vmware-system-nsx
    [ncp #### W] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[UP] https://<nsx-manager-vip>:443' due to: Unexpected error from backend manager ([]) for GET https://<nsx-manager-vip>:443/api/v1/reverse-proxy/node/health: {'healthy': False, 'components_health': 'SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP'}
    [ncp MainThread W] vmware_nsxlib.v3.cluster [####] Request failed due to: Some appliance components are not functioning properly. details: Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP
    [ncp MainThread W] vmware_nsxlib.v3.utils Finished retry of vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal for the 1st time after 0.049(s) with args: Unknown
    [ncp MainThread W] vmware_nsxlib.v3.utils Retrying call to 'vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal' for the 2nd time

Environment

vCenter 8.x

vSphere Kubernetes service

Cause

The newly provisioned VM does not get an IP address as the NSX-NCP pods were unable to reach the NSX manager.

Resolution

Ensure that the NSX Manager connectivity is verified and fully operational.