Guest cluster upgrade fails from VKr 1.30 to VKr 1.31

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Upgrading guest cluster from VKr 1.30 to VKr 1.31 does not complete successfully

Logging in to the guest cluster control plane via ssh shows that one node was updated to the target version

kubectl get nodes

<guest cluster control-plane>-6zmp9 Ready control-plane 278d v1.30.1+vmware.1-fips
<guest cluster control-plane>-nqvbx Ready control-plane 278d v1.30.1+vmware.1-fips
<guest cluster control-plane>-vdw47 Ready control-plane 4h7m v1.31.4+vmware.1-fips

<guest cluster worker>-b7cwr-s7mzr-2lqsr Ready <none> 278d v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-4qmwr Ready <none> 278d v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-5llkh Ready <none> 278d v1.30.1+vmware.1-fips

Following command shows that new node was added to database cluster and is healthy and synchronized:

etcdctl --cluster=true endpoint status -w table

+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://############:2379 | 654cc55b4b7c20b | 3.5.12+vmware.6-fips | 508 MB | true | false | 2182 | ############ | ############ | | | https://############:2379 | 118ff3d0ea4b3dba | 3.5.12+vmware.6-fips | 452 MB | false | false | 2182 | ############ | ############ | | | https://############:2379 | 354280870f24ec98 | 3.5.16+vmware.10-fips | 508 MB | false | false | 2182 | ############ | ############ | | +---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+

Describe machine that is stuck in status Provisioned

kubectl describe machine <stuck machine> -n <namespace>

Message:

* NodeHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* Control plane components: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* EtcdMemberHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist

sshing into the machine that is stuck in Provisioned status would be showing following log lines:

/var/log/cloud-init-output.log show

error execution phase control-plane-prepare/certs: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://############:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

On the NSX manager, you should see the following log lines:

/var/log/proton/nsxapi.log show

WARN GmleClientBlockingOpsThread-1 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 GmleClientImpl 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Leadership lease is not present for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER

Category: SMALL :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: MEDIUM :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: LARGE :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}

Environment

VMware vSphere Kubernetes Service, NSX 4.2.1

Cause

An NSX Manager has no leadership in the cluster for the MANAGER role, even though it is not in Maintenance Mode.

Resolution

The issue has been fixed in VMware NSX 9.0

Workaround according to KB 408537 Objects in NSX UI are stuck "In Progress"

- Access the admin CLI of the affected NSX Manager (the one without any leadership of service for group MANAGER).

- Stop the proton service: service proton stop

Note: Because the NSX Manager is not participating in the MANAGER cluster group, there's no impact to stop the proton service.

Wait 1 minute.

- Start the proton service: service proton start

Additional Information

Rolling reboots of the NSX manager also resolves the issue.