Troubleshooting Unbalanced CPU Thread Scheduling in K8s CNFs

Products

VMware Telco Cloud Automation

Issue/Introduction

Cloud Native Network Functions, may trigger CPU alarms indicating abnormal or unbalanced CPU usage. The primary symptom is a multi-threaded data plane process scheduling all of its threads (e.g., slow path threads) onto a single CPU core, despite the Kubernetes pod being allocated multiple dedicated cores.

Environment

2.x, 3.x

Cause

If the infrastructure is healthy, TKG will use the static CPU manager policy to allocate exclusive, physical cores to the pod via Linux cgroups. However, if the application's internal configuration (often defined in a K8s ConfigMap) has hardcoded CPU assignments, assumes it has access to CPU 0, or fails to read the dynamically assigned cgroup CPU list, it will fail to distribute its threads across the available allocated cores.

Resolution

To determine if the issue is an infrastructure failure (VMware/K8s) or an application misconfiguration (Vendor CNF), perform the following checks:

Step 1: Verify VM NUMA and CPU Topology Ensure the hypervisor is presenting the correct CPU topology to the worker node guest OS.

SSH into the Kubernetes worker node.
Run lscpu | grep -E "NUMA|Socket|Core" and numactl --hardware.
Expected Result: The guest OS should see the expected number of cores and a unified NUMA node (e.g., 1 Socket, 22 Cores).

Step 2: Verify Kubernetes Pod QoS and Resource Allocation Ensure Kubernetes is configured to grant dedicated/exclusive CPUs to the pod.

Run: kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 15 resources:
Expected Result: The container must have requests.cpu equal to limits.cpu, and both must be integer values (e.g., cpu: "15"). This guarantees the pod is placed in the Guaranteed QoS class, which is required for Kubelet to assign exclusive physical cores.

Step 3: Verify OS/cgroup CPU Allowances (The Definitive Check) Check exactly which physical CPU cores the Linux kernel is allowing the container to use.

Exec into the affected pod: kubectl exec -it <pod-name> -n <namespace> -- bash
Read the cgroup CPU allowed list directly from the proc filesystem: cat /proc/1/status | grep -i cpus_allowed_list (Note: If the main process is not PID 1, use cat /proc/self/status)
Expected Result: The output should show a range of CPUs matching the requested amount (e.g., Cpus_allowed_list: 2-16, which equals 15 dedicated cores).

Resolution / Remediation

If Step 3 returns a single core (e.g., Cpus_allowed_list: 4):

Fault: Infrastructure/Kubernetes.
Action: Investigate the Kubelet CPU Manager policy on the worker node. Ensure cpuManagerPolicy is set to static and the pod is correctly achieving Guaranteed QoS.

If Step 3 returns the full expected range (e.g., Cpus_allowed_list: 2-16):

The infrastructure is working perfectly. Please contact the application vendor. Provide them with the output of Step 3 as proof that the CPUs are successfully allocated to the container. Request that they fix their application's thread-pinning logic, DPDK coremask configurations, or slow-path thread assignments via their ConfigMaps.