Using Virtual NUMA Controls for Optimizing Performance of Compute-intensive Workloads

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Virtual machine applications experience poor performance despite having sufficient resources.
Guest operating system monitoring tools show unutilized CPU.
Consistent poor performance regardless of the compute and storage resource being used.
Vendor-defined application best practices are used in the vSphere environment.
Virtual machine followed the vNUMA rightsizing, but the workload is imbalanced between the cores on the NUMA node.

Environment

VMware vSphere ESXi
VMware vSphere ESXi

Cause

This can happen when the guest application needs more compute capacity than the offered by one NUMA node, but it fails to distribute the load over the additional NUMA nodes properly.

In order to confirm if that was the case, follow the steps below:

SSH to the ESXi host.
Get the common hardware info about the current system:
- ~ #sched-stats -t ncpus, and the output should be similar to:
  ex:
  56 PCPUs
  28 cores
  2 packages
  2 NUMA nodes
Calculate the number of cores on each NUMA node.
1. Cores/NUMA nodes => Core per Sockets.
2. As the host has 28 cores on 2 NUMA nodes, each NUMA node can have 14 cores.
3. VM being configured with 14 vCPUs in this setup means that it's using 1 NUMA node -> should prompt better performance if the guest application doesn't require more resources.
4. When the VM is using 16 vCPUs, exceeding the number of cores on a single NUMA node, the scheduler will split the VM's vCPUs as 8-8 and have it placed on 2 NUMA nodes -> better performance if the guest application placed the load on the respective vCPUs properly.
5. From the guest perspective, there's a virtual topology change (14 vCPUs on one virtual NUMA node and 16 vCPUs on two virtual NUMA nodes).
Confirm that the load is imbalanced on the vCPUs, by comparing the vCPU load statistics of each setting with sched-stats -t vcpu-load | grep -i vmx-vcpu, and the output will be similar to:
With 14 vCPUs, all vCPUs are busy and heavily utilized:

5675328 5675328 vmx-vcpu-0:TestVM 255 255 255 255 252 252 699310930 13321
5675333 5675328 vmx-vcpu-1:TestVM 255 255 255 255 253 253 622497743 11732
5675334 5675328 vmx-vcpu-2:TestVM 255 255 255 255 253 253 644270074 12987
5675335 5675328 vmx-vcpu-3:TestVM 255 254 254 255 255 255 599090063 16543
5675336 5675328 vmx-vcpu-4:TestVM 255 255 255 255 255 253 670941131 10552
5675337 5675328 vmx-vcpu-5:TestVM 255 255 249 255 255 250 43496715 20556
5675338 5675328 vmx-vcpu-6:TestVM 252 255 254 255 253 255 546591310 11800
5675339 5675328 vmx-vcpu-7:TestVM 255 255 254 255 255 252 576277152 11267
5675340 5675328 vmx-vcpu-8:TestVM 255 255 254 255 252 255 509256849 17341
5675341 5675328 vmx-vcpu-9:TestVM 255 255 254 255 255 252 484631494 17721
5675342 5675328 vmx-vcpu-10:TestVM 255 254 249 255 253 249 34721997 18293
5675343 5675328 vmx-vcpu-11:TestVM 255 255 254 255 253 255 494079871 20189
5675344 5675328 vmx-vcpu-12:TestVM 255 255 247 255 255 249 40676083 20240
5675345 5675328 vmx-vcpu-13:TestVM 255 255 255 255 255 255 681254271

To confirm if the physical NUMA was highly utilized, we can check the loadAvgPct by running sched-stats -t numa-pnode:

nodeID	used	idle	entitled	owed	loadAvgPct	nVcpu	freeMem	totalMem
0	2770	25230	0	0	0	0	196318932	201161092
1	27783	218	27009	0	96	14	181838524	201326592

6. With 16 vCPU, the 8 vCPUs on the first NUMA node are quite busy, while the 8 vCPUs on the second NUMA node are mostly staying idle:

2121608 2121608 vmx-vcpu-0:TestVM 250 249 240 253 251 243 5435254 12552 2121613 2121608 vmx-vcpu-1:TestVM 251 249 237 253 251 238 5349609 14108 2121614 2121608 vmx-vcpu-2:TestVM 250 245 247 255 250 248 5879843 11768 2121615 2121608 vmx-vcpu-3:TestVM 249 244 237 250 246 238 3528458 13786 2121616 2121608 vmx-vcpu-4:TestVM 236 238 227 241 236 227 3348875 19971 2121617 2121608 vmx-vcpu-5:TestVM 233 236 235 237 238 237 3254107 28640 2121618 2121608 vmx-vcpu-6:TestVM 232 241 225 238 241 226 3583011 15953 2121619 2121608 vmx-vcpu-7:TestVM 245 247 242 245 245 245 4113962 20198 2121620 2121608 vmx-vcpu-8:TestVM 50 43 40 40 35 42 104658 78247 2121621 2121608 vmx-vcpu-9:TestVM 30 37 42 37 34 40 362357 91945 2121622 2121608 vmx-vcpu-10:TestVM 3 5 21 14 12 23 149275 116822 2121623 2121608 vmx-vcpu-11:TestVM 1 5 6 11 5 6 152050 853567 2121624 2121608 vmx-vcpu-12:TestVM 0 1 6 6 4 5 362251 2245403 2121625 2121608 vmx-vcpu-13:TestVM 5 0 3 11 2 5 422925 3537338 2121626 2121608 vmx-vcpu-14:TestVM 2 0 3 7 2 3 302333 2874573 2121627 2121608 vmx-vcpu-15:TestVM 1 1 2 12 4 5 817400 3382363

Resolution

Engage the application vendor to investigate why the workload isn't being evenly distributed.

Workaround:

To workaround this, configure the VM to use one virtual NUMA node with 16 vCPUs by faking the virtual topology and not having it match the underlying hardware.

As the VM isn't aware that its 16 vCPUs are running on 2 different physical NUMA modes, it can potentially place a workload on vCPUs that are far apart.
It's a trade-off that needs to happen if the workload cannot be distributed evenly by the OS; it's either optimized for the compute capacity (with this proposed setting), or for locality (without the proposed setting).

Another way of placing all vCPUs on the same NUMA node would be to use hyper-threading with NUMA. For more, see Configure virtual machines to use hyper-threading with NUMA in VMware ESXi

Note: In this case where the data showed that workloads are compute-intensive, sharing hyper-threads may not be optimal.

Steps:

Power off the VM.
Configure it with 16 vCPUs.
Set the following option in the VM's advanced settings numa.vcpu.maxPerVirtualNode = "16"
Power on the VM.

For more, see Virtual NUMA Controls

The VM will still run on 2 physical NUMA nodes, but it was presented to it as 1 virtual NUMA node.