Performance Degradation After Migration from Kubernetes 1.25 to 1.30 Due to Oversized Worker Nodes
search cancel

Performance Degradation After Migration from Kubernetes 1.25 to 1.30 Due to Oversized Worker Nodes

book

Article ID: 410249

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After migrating the application from Kubernetes cluster version 1.25 to 1.30, significant slowdown was observed in the application’s processing queue. Tasks that previously completed within expected timeframes were delayed, causing downstream bottlenecks and impacting overall system responsiveness. The upgrade included a change in node pool architecture, shifting from many smaller worker nodes (16 vCPUs × ~25 nodes) in 1.25 to fewer, oversized nodes (32 vCPUs × ~5 nodes) in 1.30 due to NSX load balancer limitations. Despite maintaining the same number of replicas, performance degradation was observed.

How to Check in esxtop:

  1. SSH into the ESXi host.
  2. Run esxtop.
  3. Press c to enter the CPU view.
  4. Sort by %RDY to see which VMs are experiencing the most CPU ready time.
    1. To sort esxtop by the %RDY (CPU Ready) column, you need to press the 'R' key while the CPU view is active in esxtop. A high %RDY value indicates a virtual machine (VM) is waiting for CPU access, suggesting the ESXi host is experiencing high CPU utilization or a lack of available resources, and should be investigated further.
  5. Observe the r column for the run queue, which shows the number of threads waiting for CPU execution.
  6. Compare oversized VMs to smaller VMs on the same host to verify impact of large vCPU count.

Cause

The performance regression was caused by oversized worker VMs. The 32 vCPU nodes introduced severe CPU scheduling contention at the ESXi layer, evidenced by CPU Ready times of ~54% and non-zero Co-Stop values. This co-scheduling overhead reduced the ability of the hypervisor to allocate CPU cycles efficiently. Smaller nodes (14-19 vCPUs) on the same hosts did not exhibit this behavior, confirming node size as the root cause.

Resolution

To resolve the issue, create a new node pool using smaller VMs, ideally ≤16 vCPUs per worker node, and scale out horizontally rather than scaling up. After resizing, monitor CPU Ready times in vCenter/ESXi to ensure they drop below 5% (ideally <2%). This configuration alleviates CPU contention, improves pod scheduling, restores throughput, and resolves the observed application slowdown. VMware’s vSphere 8.0 Performance Best Practices support right-sizing VMs to avoid oversized CPU allocations.

Additional Information