Using Virtual NUMA Controls for Optimizing Performance of Compute-intensive Workloads
search cancel

Using Virtual NUMA Controls for Optimizing Performance of Compute-intensive Workloads

book

Article ID: 323386

calendar_today

Updated On: 03-24-2025

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
  • Virtual machine applications experience poor performance despite having sufficient resources.
  • Guest operating system monitoring tools show unutilized CPU.
  • Consistent poor performance regardless of the compute and storage resource being used.
  • Vendor-defined application best practices are used in the vSphere environment.
  • Virtual machine followed the vNUMA rightsizing, but the workload is imbalanced between the cores on the NUMA node.


Environment

VMware vSphere ESXi 
VMware vSphere ESXi 

Cause

This can happen when the guest application needs more compute capacity than the offered by one NUMA node, but it fails to distribute the load over the additional NUMA nodes properly.
 

In order to confirm if that was the case, follow the steps below:

  1. SSH to the ESXi host.
  2. Get the common hardware info about the current system:
    • ~ #sched-stats -t ncpus, and the output should be similar to:
      ex:
      56 PCPUs
      28 cores
      2 packages
      2 NUMA nodes

  3. Calculate the number of cores on each NUMA node.
    1. Cores/NUMA nodes => Core per Sockets.
    2. As the host has 28 cores on 2 NUMA nodes, each NUMA node can have 14 cores.
    3. VM being configured with 14 vCPUs in this setup means that it's using 1 NUMA node -> should prompt better performance if the guest application doesn't require more resources.
    4. When the VM is using 16 vCPUs, exceeding the number of cores on a single NUMA node, the scheduler will split the VM's vCPUs as 8-8 and have it placed on 2 NUMA nodes -> better performance if the guest application placed the load on the respective vCPUs properly.
    5. From the guest perspective, there's a virtual topology change (14 vCPUs on one virtual NUMA node and 16 vCPUs on two virtual NUMA nodes).

  4. Confirm that the load is imbalanced on the vCPUs, by comparing the vCPU load statistics of each setting with sched-stats -t vcpu-load | grep -i vmx-vcpu, and the output will be similar to: 


  5. With 14 vCPUs, all vCPUs are busy and heavily utilized:

    5675328 5675328 vmx-vcpu-0:TestVM 255 255 255 255 252 252 699310930 13321
    5675333 5675328 vmx-vcpu-1:TestVM 255 255 255 255 253 253 622497743 11732
    5675334 5675328 vmx-vcpu-2:TestVM 255 255 255 255 253 253 644270074 12987
    5675335 5675328 vmx-vcpu-3:TestVM 255 254 254 255 255 255 599090063 16543
    5675336 5675328 vmx-vcpu-4:TestVM 255 255 255 255 255 253 670941131 10552
    5675337 5675328 vmx-vcpu-5:TestVM 255 255 249 255 255 250 43496715 20556 
    5675338 5675328 vmx-vcpu-6:TestVM 252 255 254 255 253 255 546591310 11800
    5675339 5675328 vmx-vcpu-7:TestVM 255 255 254 255 255 252 576277152 11267
    5675340 5675328 vmx-vcpu-8:TestVM 255 255 254 255 252 255 509256849 17341
    5675341 5675328 vmx-vcpu-9:TestVM 255 255 254 255 255 252 484631494 17721
    5675342 5675328 vmx-vcpu-10:TestVM 255 254 249 255 253 249 34721997 18293 
    5675343 5675328 vmx-vcpu-11:TestVM 255 255 254 255 253 255 494079871 20189
    5675344 5675328 vmx-vcpu-12:TestVM 255 255 247 255 255 249 40676083 20240 
    5675345 5675328 vmx-vcpu-13:TestVM 255 255 255 255 255 255 681254271

    To confirm if the physical NUMA was highly utilized, we can check the loadAvgPct by running sched-stats -t numa-pnode:
 nodeID used idle entitled  owed loadAvgPct  nVcpu  freeMem  totalMem
0  2770 25230 0 0 0 0  196318932 201161092
1 27783 218 27009 0 96 14 181838524 201326592

 
       6. 
With 16 vCPU, the 8 vCPUs on the first NUMA node are quite busy, while the 8 vCPUs on the second NUMA node are mostly staying idle:


2121608 2121608 vmx-vcpu-0:TestVM 250 249 240 253 251 243 5435254 12552
2121613 2121608 vmx-vcpu-1:TestVM 251 249 237 253 251 238 5349609 14108
2121614 2121608 vmx-vcpu-2:TestVM 250 245 247 255 250 248 5879843 11768
2121615 2121608 vmx-vcpu-3:TestVM 249 244 237 250 246 238 3528458 13786
2121616 2121608 vmx-vcpu-4:TestVM 236 238 227 241 236 227 3348875 19971
2121617 2121608 vmx-vcpu-5:TestVM 233 236 235 237 238 237 3254107 28640
2121618 2121608 vmx-vcpu-6:TestVM 232 241 225 238 241 226 3583011 15953
2121619 2121608 vmx-vcpu-7:TestVM 245 247 242 245 245 245 4113962 20198
2121620 2121608 vmx-vcpu-8:TestVM 50 43 40 40 35 42 104658 78247
2121621 2121608 vmx-vcpu-9:TestVM 30 37 42 37 34 40 362357 91945
2121622 2121608 vmx-vcpu-10:TestVM 3 5 21 14 12 23 149275 116822
2121623 2121608 vmx-vcpu-11:TestVM 1 5 6 11 5 6 152050 853567
2121624 2121608 vmx-vcpu-12:TestVM 0 1 6 6 4 5 362251 2245403
2121625 2121608 vmx-vcpu-13:TestVM 5 0 3 11 2 5 422925 3537338
2121626 2121608 vmx-vcpu-14:TestVM 2 0 3 7 2 3 302333 2874573
2121627 2121608 vmx-vcpu-15:TestVM 1 1 2 12 4 5 817400 3382363



Resolution

Engage the application vendor to investigate why the workload isn't being evenly distributed.


Workaround:

To workaround this, configure the VM to use one virtual NUMA node with 16 vCPUs by faking the virtual topology and not having it match the underlying hardware.
 

As the VM isn't aware that its 16 vCPUs are running on 2 different physical NUMA modes, it can potentially place a workload on vCPUs that are far apart.
It's a trade-off that needs to happen if the workload cannot be distributed evenly by the OS; it's either optimized for the compute capacity (with this proposed setting), or for locality (without the proposed setting).

 

Another way of placing all vCPUs on the same NUMA node would be to use hyper-threading with NUMA. For more, see Configure virtual machines to use hyper-threading with NUMA in VMware ESXi

 

Note: In this case where the data showed that workloads are compute-intensive, sharing hyper-threads may not be optimal.


Steps:

  • Power off the VM.
  • Configure it with 16 vCPUs.
  • Set the following option in the VM's advanced settings numa.vcpu.maxPerVirtualNode = "16"
  • Power on the VM.
For more, see Virtual NUMA Controls
The VM will still run on 2 physical NUMA nodes, but it was presented to it as 1 virtual NUMA node.