Worker Nodes of existing CaaS Clusters Stuck in Ready,SchedulingDisabled

search cancel

Worker Nodes of existing CaaS Clusters Stuck in Ready,SchedulingDisabled

book

Article ID: 378017

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

Nodes are stuck in Ready,SchedulingDisabled upon performing any operation on the Cluster / Node Pool - or performing some CNF LCM which could trigger DIP.

Environment

VMware Telco Cloud Automation 3.1.1 (upgraded from TCA 2.3 / 3.0)

Cause

This could happen in TCA 3.1.1 environments (which have been upgraded from TCA 2.3)
Workload Clusters are on version 1.24 (or higher with photon3) - and have not been edited post upgrade to TCA 2.3. Only Management Cluster operators have been updated.
- This means that the nodeconfig-operator is still on the corresponding TCA 2.3 version.
From here, any operation done on the internals of a Workload Clusters is eventually sent to the nodeconfig-operator to apply within the Workload cluster.
In this case, maybe a Harbor credential update was done - which was sent to the nodeconfig-operator.
As part of this, the operator will sync tdnf repos, etc. to ensure that things are up-to-date.
With airgap 3.1.1, we are also including additional photon 5 repositories which ends up increasing the list of libraries available for tdnf to sync.
This increase in listing causes the amount of memory allocated to the nodeconfig-operator (deployed in TCA 2.3) to be insufficient.
Thus, the operator fails - leading to other potential issues.

Resolution

This issue is resolved when the operators within the Clusters are upgraded to the TCA 3.1.1 compliant versions.

There are 2 possible solutions available:

Update to the latest TBR for K8s 1.24 which comes with TCA 3.1.1

This needs to be performed at each cluster and each node pool (possibly) within the cluster.
Edit the Workload Cluster / Node Pool
User needs to select the same k8s version with the new TBR version (from TCA 3.1.1) and also ensure that the VM Template that is selected is the same as before.
This way, only the TCA operators would be updated and there would be no other change to the Workload Clusters. There should be no redeploy whatsoever.
The nodeconfig-operators that come with the updated TBR (TCA 3.1.1) already have a higher memory limit and should not exhibit the issues seen here.

Manually update nodeconfig-operator daemon memory limit

Customers can update the nodeconfig-operator daemon's memory limits manually on the K8s Cluster

The memory limits for the nodeconfig-operator need to be bumped up to 2G.

Enlarge the memory limit to 2G for nodeconfig-daemon pod on each work node.

SSH into the control plane of workload cluster as the capv user.

Enlarge memory limit for the nodeconfig-daemon pod by running the following command:

curl -kfsSL 'https://vmwaresaas.jfrog.io/artifactory/cnf-generic-local/kb/20240911/enlarge_nc_ds_memory_limit.sh' | bash

The script output will print the information to indicate if the nodeconfig-daemon pod has refreshed.

Sample Output:

Patch nodeconfig-operator addon successfully
Unpause pkgi/nodeconfig-operator successfully
Will check nodeconfig addon status after patch
Pkgi reconcile succeeded
All nodeconfig pods are running
Enlarge nodeconfig daemon memory successfully

Feedback

thumb_up Yes

thumb_down No