The nodeconfig-daemon pod fails to start and gets stuck in a pending state due to insufficient CPU resources in Telco Cloud Automation (TCA) 2.1.1.
search cancel

The nodeconfig-daemon pod fails to start and gets stuck in a pending state due to insufficient CPU resources in Telco Cloud Automation (TCA) 2.1.1.

book

Article ID: 314253

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

Remove the 100Mhz CPU reservation request for the nodeconfig-daemon pods. 


Symptoms:

In Telco Cloud Automation (TCA) 2.1.1, to ensure fundamental services have sufficient resources, TCA will request a reservation of 100Mhz CPU for the nodeconfig-daemon pod, compared to 0Mhz in the previous releases of TCA.   
 

In some environments, most of the vCPUs on the workload nodes are isolated for Network Function pods, leading to a lack of resources for reservations This can result in some pods getting stuck in a pending state due to “Insufficient CPU.” 


Environment

VMware Telco Cloud Automation 2.1.1

Cause

Before starting a pod, kubelet will validate the resources being requested by the pod. If there are not enough resources available for the pod on any node, TCA will report an event and mark the pod as pending. Describing the node will point to insufficient CPU resources as the cause. 

Example:

Events: 
 Type         Reason                    Age      From                           Message 
 ----             ------                         ----        ----                              ------- 
 Warning   FailedScheduling     31s      default-scheduler         0/2 nodes are available:
                                                                                                    1  Insufficient cpu,
                                                                                                    1 node(s) didn't match
                                                                                                    pod's 
node 
                                                                                                    affinity/selector. 

 

Resolution

This issue will be resolved in TCA 3.0.


Workaround:

Remove the CPU request for the nodeconfig-daemon pod on each work node. 

1. SSH into the control plane of workload cluster as the capv user. 

2. Remove the CPU request for the nodeconfig-daemon pod by running the following command: 

curl -kfsSL 'https://vmwaresaas.jfrog.io/artifactory/cnf-generic-local/kb/20230508/remove_nc_ds_cpu_request.sh' | bash 

3. The script output will print the information to indicate if the nodeconfig-daemon pod has refreshed.  

Example: 
Patch nodeconfig-operator addon successfully 
Unpause pkgi/nodeconfig-operator successfully 
Will check nodeconfig addon status after patch 
Pkgi reconcile succeeded 
All nodeconfig pods are running 
Nodeconfig daemon Request CPU reduced successfully 


Additional Information

Impact/Risks:

Impacts TCA 2.1.1.