This occurs due to the architecture of the VMkernel NUMA scheduler and is dependant on the number of CPU cores that exist per NUMA node. If the virtual machine has a higher number of vCPUs than the number of cores in the NUMA node, then the vCPUs are broken down into clients that are scheduled on multiple nodes.
For example, if the hardware configuration is a 48 core system (4 socket, 12 core physical CPU configurations have been known to show this behavior) that has 6 physical CPU per NUMA node, an 8 vCPU virtual machine is split into 4 vCPU clients that are scheduled on two different nodes. This is done so that even virtual machines wider than a single NUMA node benefit from memory locality benefits.
The problem is that 4 vCPU clients are treated as a granular unit for NUMA management. When 4–8 vCPU virtual machines are powered on, all the 8 NUMA nodes have 4 vCPUs each. When the fifth virtual machine is powered on, its 2 clients get scheduled on 2 NUMA nodes which already have 4 vCPU clients running on them. These 2 NUMA nodes therefore have 8 vCPUs each (on 6 cores); causing those nodes to be overloaded and a high ready time to be experienced.
This is expected behavior based on the current architecture of the scheduler. However, these points help to limit the impact of the issue:
- Lower the number of vCPUs. From the example above, you would lower the number of vCPUs from 8 to 6 or less. If the virtual machines are sized such that they are a whole multiple or divisor of the NUMA node size, this helps with the number of virtual machines that you can power on. If 6 vCPU virtual machines are used, you can run up to at least 8 of those virtual machines (with 100% CPU utilization) without incurring substantial ready times.
Note: In a DRS cluster, the virtual machines need to be sized appropriately for the whole cluster, as the virtual machines can be migrated between the hosts in the cluster. To make the sizing easier, it is a good idea to have systems with the same NUMA characteristics (mainly the number of cores per NUMA node) in the cluster.
- Spread out the large vCPU virtual machines across the environment. The impact is reduced as there is less per host. The administrator should also monitor or utilize DRS to ensure that the distribution of large vCPU virtual machines is constant.
- Size the virtual machines appropriately. If the application running in the system is not going to benefit from having multiple vCPUs, do not configure them.
-
Disable NUMA. Disabling NUMA resolves the CPU ready time issue, but should be used as a last resort as the inter-node latencies among certain nodes may be high. Disabling NUMA is discouraged on servers with many NUMA nodes and the latency between the nodes might be significant. NUMA can be disabled by enabling Node Interleaving in the BIOS of the ESX host. High ready times are not seen in this configuration because the scheduler no longer takes NUMA locality into account, which means that there are no restrictions on where the virtual machines can run. When NUMA is enabled, the high ready times are as a result of trying to schedule the virtual machines to their local node to prevent any performance hit from remote locality.
Note: In internal performance evaluations, VMware has observed high performance degradation on disabling NUMA, particularly when memory intensive applications are running in the virtual machines.