Guest cluster upgrade fails from VKr 1.30 to VKr 1.31
search cancel

Guest cluster upgrade fails from VKr 1.30 to VKr 1.31

book

Article ID: 411285

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Upgrading guest cluster from VKr 1.30 to VKr 1.31 does not complete successfully

Symptoms:

Logging in to the guest cluster control plane via ssh shows that one node was updated to the target version

 

kubectl get nodes

<guest cluster control-plane>-6zmp9         Ready    control-plane   278d   v1.30.1+vmware.1-fips
<guest cluster control-plane>-nqvbx         Ready    control-plane   278d   v1.30.1+vmware.1-fips
<guest cluster control-plane>-vdw47         Ready    control-plane   4h7m   v1.31.4+vmware.1-fips

<guest cluster worker>-b7cwr-s7mzr-2lqsr   Ready    <none>          278d   v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-4qmwr   Ready    <none>          278d   v1.30.1+vmware.1-fips
<guest cluster worker>-b7cwr-s7mzr-5llkh   Ready    <none>          278d   v1.30.1+vmware.1-fips

etcd shows that new node was added:

etcdctl --cluster=true endpoint status -w table

+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        |        VERSION        | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://############:2379 |  654cc55b4b7c20b |  3.5.12+vmware.6-fips |  508 MB |      true |      false |      2182 | ############ |         ############ |        |
| https://############:2379 | 118ff3d0ea4b3dba |  3.5.12+vmware.6-fips |  452 MB |     false |      false |      2182 | ############ |         ############ |        |
| https://############:2379 | 354280870f24ec98 | 3.5.16+vmware.10-fips |  508 MB |     false |      false |      2182 | ############ |         ############ |        |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+

Describe machine that is stuck in status Provisioned

kubectl describe machine <stuck machine> -n <namespace>

 Message:              

* NodeHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* Control plane components: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
* EtcdMemberHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist

 

ssh into the machine that is stuck in status Provisioned:

/var/log/cloud-init-output.log show

error execution phase control-plane-prepare/certs: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://############:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

 

NSX manager

/var/log/proton/nsxapi.log show

WARN GmleClientBlockingOpsThread-1 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
WARN GmleClientBlockingOpsThread-3 GmleClientImpl 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Leadership lease is not present for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER

Category: SMALL :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: MEDIUM :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
Category: LARGE :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}

Environment

VMware vSphere Kubernetes Service, NSX 4.2.1

Cause

An NSX Manager has no leadership in the cluster for the MANAGER role, even though it is not in Maintenance Mode.

Resolution

Issue is fixed in VMware NSX 9.0

 

Workaround according to KB 408537 Objects in NSX UI are stuck "In Progress"

-   Access the admin CLI of the affected NSX Manager (the one without any leadership of service for group MANAGER).

-  Stop the proton service: service proton stop

   Note: Because the NSX Manager is not participating in the MANAGER cluster group, there's no impact to stop the proton service.

   Wait 1 minute.

-  Start the proton service: service proton start