Symptoms:
You might notice a seemingly random difference in performance whenever powering on virtual machines with at least one of these options:
- PCI passthrough devices (DirectPath IO or SR-IOV)
- Fault Tolerance (FT) enabled
- Latency Sensitivity set to High
- vGPU enabled
Affected virtual machines will have either 0% or very little NUMA local memory.
In
esxtop's memory (
m) view and with the fields (
f) for
NUMA STATS (
g) enabled, you can confirm this by monitoring "
N%L":
9:43:04am up 5 days 36 min, 538 worlds, 1 VMs, 4 vCPUs; MEM overcommit avg: 0.00, 0.00, 0.00
PMEM /MB: 131026 total: 2160 vmk,1185 other, 127680 free
VMKMEM/MB: 130640 managed: 1920 minfree, 7048 rsvd, 123592 ursvd, high statej
NUMA /MB: 65488 (63845), 65536 (63450)
PSHARE/MB: 27 shared, 27 common: 0 saving
SWAP /MB: 0 curr, 0 rclmtgt: 0.00 r/s, 0.00 w/s
ZIP /MB: 0 zipped, 0 saved
MEMCTL/MB: 0 curr, 0 target, 0 max
GID NAME NHN NMIG NRMEM NLMEM N%L GST_ND0 OVD_ND0 GST_ND1 OVD_ND1
7656943 VMNAME 0 0 1024.00 0.00 0 0.00 12.09 1024.00 1.49
(...)
Alternatively on the ESXi CLI, use "
sched-stats". Here look for the "
currLocal%" and "
cummLocal%" columns, the later being the cumulative percentage of memory locality since power on:
# sched-stats -t numa-clients
groupName groupID clientID homeNode affinity nWorlds vmmWorlds localMem remoteMem currLocal% cummLocal%
vm.1194167 7656943 0 0 0xf 6 6 0 33554432 0 0
(...)
In most cases, you will either see none or very little NUMA migrations for the affected VMs (*Mig columns)
# sched-stats -t numa-migration
groupName groupID clientID balanceMig loadMig localityMig longTermMig monitorMig pageMigRate
vm.1194167 7656943 0 0 0 0 0 0 0
(...)
To map the groupID to the virtual machine's display name, either use
esxtop (
GID and
NAME column) or "
esxcli vm process list" and match "
VMX Cartel ID:" to "
groupName" without the "
vm.".