Guest cluster upgrade fails from VKr 1.30 to VKr 1.31
search cancel

Guest cluster upgrade fails from VKr 1.30 to VKr 1.31

book

Article ID: 411285

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Upgrading guest cluster from VKr 1.30 to VKr 1.31 does not complete successfully
  • Logging in to the guest cluster control plane via ssh shows that one node was updated to the target version

kubectl get nodes

<guest cluster control-plane>-6zmp9         Ready    control-plane   278d   v1.30.1+vmware.1-fips
<guest cluster control-plane>-nqvbx         Ready    control-plane   278d   v1.30.1+vmware.1-fips
<guest cluster control-plane>-vdw47         Ready    control-plane   4h7m   v1.31.4+vmware.1-fips

<guest cluster worker>-b7cwr-s7mzr-2lqsr   Ready    <none>          278d   v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-4qmwr   Ready    <none>          278d   v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-5llkh   Ready    <none>          278d   v1.30.1+vmware.1-fips

  • Following command shows that new node was added to database cluster and is healthy and synchronized:

etcdctl --cluster=true endpoint status -w table

+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        |        VERSION        | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://############:2379 |  654cc55b4b7c20b |  3.5.12+vmware.6-fips |  508 MB |      true |      false |      2182 | ############ |         ############ |        |
| https://############:2379 | 118ff3d0ea4b3dba |  3.5.12+vmware.6-fips |  452 MB |     false |      false |      2182 | ############ |         ############ |        |
| https://############:2379 | 354280870f24ec98 | 3.5.16+vmware.10-fips |  508 MB |     false |      false |      2182 | ############ |         ############ |        |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+

  • Describe machine that is stuck in status Provisioned

kubectl describe machine <stuck machine> -n <namespace>

 Message:              

* NodeHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* Control plane components: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* EtcdMemberHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist

  • sshing into the machine that is stuck in Provisioned status would be showing following log lines:

/var/log/cloud-init-output.log show

error execution phase control-plane-prepare/certs: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://############:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  • On the NSX manager, you should see the following log lines:

/var/log/proton/nsxapi.log show

WARN GmleClientBlockingOpsThread-1 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 GmleClientImpl 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Leadership lease is not present for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER

Category: SMALL :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: MEDIUM :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: LARGE :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}

Environment

VMware vSphere Kubernetes Service, NSX 4.2.1

Cause

An NSX Manager has no leadership in the cluster for the MANAGER role, even though it is not in Maintenance Mode.

Resolution

The issue has been fixed in VMware NSX 9.0


Workaround according to KB 408537 Objects in NSX UI are stuck "In Progress"

-   Access the admin CLI of the affected NSX Manager (the one without any leadership of service for group MANAGER).

-  Stop the proton service: service proton stop

   Note: Because the NSX Manager is not participating in the MANAGER cluster group, there's no impact to stop the proton service.

   Wait 1 minute.

-  Start the proton service: service proton start

 



Additional Information

Rolling reboots of the NSX manager also resolves the issue.