Worker Nodes of existing CaaS Clusters Stuck in Ready,SchedulingDisabled
book
Article ID: 378017
calendar_today
Updated On:
Products
VMware Telco Cloud Automation
Issue/Introduction
Nodes are stuck in Ready,SchedulingDisabled upon performing any operation on the Cluster / Node Pool - or performing some CNF LCM which could trigger DIP.
This could happen in TCA 3.1.1 environments (which have been upgraded from TCA 2.3)
Workload Clusters are on version 1.24 (or higher with photon3) - and have not been edited post upgrade to TCA 2.3. Only Management Cluster operators have been updated.
This means that the nodeconfig-operator is still on the corresponding TCA 2.3 version.
From here, any operation done on the internals of a Workload Clusters is eventually sent to the nodeconfig-operator to apply within the Workload cluster.
In this case, maybe a Harbor credential update was done - which was sent to the nodeconfig-operator.
As part of this, the operator will sync tdnf repos, etc. to ensure that things are up-to-date.
With airgap 3.1.1, we are also including additional photon 5 repositories which ends up increasing the list of libraries available for tdnf to sync.
This increase in listing causes the amount of memory allocated to the nodeconfig-operator (deployed in TCA 2.3) to be insufficient.
Thus, the operator fails - leading to other potential issues.
Resolution
This issue is resolved when the operators within the Clusters are upgraded to the TCA 3.1.1 compliant versions.
There are 2 possible solutions available:
Update to the latest TBR for K8s 1.24 which comes with TCA 3.1.1
This needs to be performed at each cluster and each node pool (possibly) within the cluster.
Edit the Workload Cluster / Node Pool
User needs to select the same k8s version with the new TBR version (from TCA 3.1.1) and also ensure that the VM Template that is selected is the same as before.
This way, only the TCA operators would be updated and there would be no other change to the Workload Clusters. There should be no redeploy whatsoever.
The nodeconfig-operators that come with the updated TBR (TCA 3.1.1) already have a higher memory limit and should not exhibit the issues seen here.