Upgrading guest cluster from VKr 1.30 to VKr 1.31 does not complete successfully
Symptoms:
Logging in to the guest cluster control plane via ssh shows that one node was updated to the target version
kubectl get nodes
-6zmp9 Ready control-plane 278d v1.30.1+vmware.1-fips<guest cluster control-plane>-nqvbx Ready control-plane 278d v1.30.1+vmware.1-fips<guest cluster control-plane>-vdw47 Ready control-plane 4h7m v1.31.4+vmware.1-fips<guest cluster control-plane>
-b7cwr-s7mzr-2lqsr Ready <none> 278d v1.30.1+vmware.1-fips<guest cluster worker>-b7cwr-s7mzr-4qmwr Ready <none> 278d v1.30.1+vmware.1-fips<guest cluster worker>-b7cwr-s7mzr-5llkh Ready <none> 278d v1.30.1+vmware.1-fips<guest cluster worker>
etcd shows that new node was added:
etcdctl --cluster=true endpoint status -w table
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://############:2379 | 654cc55b4b7c20b | 3.5.12+vmware.6-fips | 508 MB | true | false | 2182 | ############ | ############ | |
| https://############:2379 | 118ff3d0ea4b3dba | 3.5.12+vmware.6-fips | 452 MB | false | false | 2182 | ############ | ############ | |
| https://############:2379 | 354280870f24ec98 | 3.5.16+vmware.10-fips | 508 MB | false | false | 2182 | ############ | ############ | |
+---------------------------+------------------+-----------------------+---------+-----------+------------+-----------+------------+--------------------+--------+
Describe machine that is stuck in status Provisioned
kubectl describe machine <stuck machine> -n <namespace>
Message:
* NodeHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist* Control plane components: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist* EtcdMemberHealthy: Waiting for a Node with spec.providerID vsphere://4226caec-131e-d7d9-672d-be38762b9fc1 to exist
ssh into the machine that is stuck in status Provisioned:
/var/log/cloud-init-output.log show
error execution phase control-plane-prepare/certs: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://############:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
NSX manager
/var/log/proton/nsxapi.log show
WARN GmleClientBlockingOpsThread-1 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCERWARN GmleClientBlockingOpsThread-3 Lease 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="lease" subcomp="manager"] Leadership lease size is 0 for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCERWARN GmleClientBlockingOpsThread-3 GmleClientImpl 5305 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Leadership lease is not present for group 1f2dfa56-7dcf-3583-bedd-f7aec167fec7 and service POLICY_SVC_LOADBALANCER
Category: SMALL :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}Category: MEDIUM :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}Category: LARGE :: Total weight: 0, Max capacity: 0, Current occupied weight: 0, holds leadership for services: {}
VMware vSphere Kubernetes Service, NSX 4.2.1
An NSX Manager has no leadership in the cluster for the MANAGER role, even though it is not in Maintenance Mode.
Issue is fixed in VMware NSX 9.0
Workaround according to KB 408537 Objects in NSX UI are stuck "In Progress"
- Access the admin CLI of the affected NSX Manager (the one without any leadership of service for group MANAGER).
- Stop the proton service: service proton stop
Note: Because the NSX Manager is not participating in the MANAGER cluster group, there's no impact to stop the proton service.
Wait 1 minute.
- Start the proton service: service proton start