PSOD - PF Exception 14 in world NetWorld-VM- IP 0xXXXXXXXXXXX addr 0xXXX

Products

VMware vSphere ESXi

Issue/Introduction

Host crashes with a PSOD with a back trace similar to the following:

Panic Message: @BlueScreen: #PF Exception 14 in world 19435391:NetWorld-VM- IP 0x4200188bcaed addr 0x270
Backtrace:
0x453ab139bee0:[0x4200188bcaed]Vmxnet3VMKDev_AsyncTx@vmkernel#nover+0x1d stack: 0x430215a1b800, 0x42001873b1ee, 0x431582248148, 0x21218872707, 0x1
0x453ab139bf50:[0x420018926ba5]NetWorldPerVMCB@vmkernel#nover+0x1aa stack: 0x430215985588, 0x700000000, 0x0, 0x0, 0x0
0x453ab139bfe0:[0x420018cd67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x420018744cf0, 0x0, 0x0, 0x0
0x453ab139c000:[0x420018744cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

The host maybe in a cluster were many of the VM's have full memory reservations. In addition, there maybe complex anti-affinity rules configured on the cluster which restricts how DRS can balance the workloads.
There are a lot of "Admission failure" logs referencing vmxnet3Fastslab on the host in the /var/run/log/vmkernel.log, which indicate that the host is out of reserved memory:

In(182) vmkernel: cpu85:19435559)Admission failure in path: host:system:net:vmxnet3Fastslab-0x430340904fc0
In(182) vmkernel: cpu85:19435559)vmxnet3Fastslab-0x430340904fc0 (162) requires 2048 KB, asked 2048 KB from host (0) which has 1609940220 KB occupied and 0 KB available.
In(182) vmkernel: cpu85:19435559)Admission failure in path: host:system:net:vmxnet3Fastslab-0x430340904fc0
In(182) vmkernel: cpu85:19435559)vmxnet3Fastslab-0x430340904fc0 (162) requires 2048 KB, asked 2048 KB from host (0) which has 1609940220 KB occupied and 0 KB available.

Environment

VMware vSphere ESXi

Cause

At the time of the PSOD there is not enough free memory left on the host to allow vmkernel processes to reserve memory so they can complete their tasks.
The host maybe in that state due to the host running many VM's with full memory reservations. In addition, there maybe complex anti-affinity rules configured on the cluster which restricts how DRS can balance the workloads. Due to this DRS is restricted on how it can distribute VM's throughout the cluster, which leads to a skew on VM loads per host.
Due to the low memory state it leads to a vMotion of a VM failing at the point of migrating the VM's DVfilter, which results in stale objects being left in memory.
After the vMotion fails the VM will be unstunned on its current host and the VM's NICs will be activated again. But when the kernel is activating the NICs there is a PSOD as it is referencing stale entries in memory.
The sequence of events as seen in the /var/run/log/vmkernel.log will be similar to the following:
- VM is suspended and stunned successfully during the vmotion:
  
  In(05) vcpu-0 - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_PRECOPY (3) to MIGRATE_TO_VMX_CHECKPT (4).
  In(05) vcpu-0 - Migrate: Preparing to suspend.
  In(05) vcpu-0 - Migrate: VM starting stun, waiting 100 seconds for go/no-go message.
  
  ...
  In(05) vcpu-0 - Migrate: VM successfully stunned.
- As part of the quiesce of the VM state the port is disabled, but this fails due to the low memory state on the host.
- The vmotion fails since there is a failure to vmotion the dvfilter. The vmotion of the filter fails since since memory fails to be allocated for the task:
  
  In(05) vmx - Migrate: Caching migration error message list:
  In(05) vmx - [msg.migrate.fail.source.afterPrecopy] Migration failed after VM memory precopy. Please check vmkernel log for true error.
  In(05) vmx - [vob.heap.grow.failed] Heap dvfilterVMotion could not be grown by 1757184 bytes for allocation of 1757184 bytes
  In(05) vmx - Migrate: Attempting to continue running on the source.
  In(05) vmx - Checkpoint_Unstun: vm stopped for 172309 us
- The VM is unstunned again due to the failure to vmotion the dvfilter. As part of that the vNIC is activated again but this fails:
  
  In(05) vcpu-0 - VMXNET3 user: failed to activate 'EthernetX', status: 0xbad0005
- The vNIC failed to activate because of stale entries in memory which were not cleaned up. The reference to the stale entries triggers the PSOD on the host:
  
  In(182) vmkernel: cpu85:19435559)Vmxnet3: 15255: Using default queue delivery for vmxnet3 for port <PORT ID>
  In(182) vmkernel: cpu85:19435559)NetEvent: 1264: failed to subscribe callback 0x4200188aa110 to netEvt port.vswitch.port.quiesce on chain port-4000147
- The stale entries would have been cleaned up as part of disabling the port during the vmotion. But the port failed to be disabled correctly due to the low memory state on the host.

Resolution

One of the following steps or a combination of these steps should make it far less probable that the reserved memory exhaustion condition is hit again:
- Add one or more additional hosts to the cluster.
- Review the configured affinity/anti-affinity rules to see if they could be relaxed.
- Reduce the number of VM's in the cluster or reduce the memory reservations if possible.
- Consider using the parameter 'MaxMemMBHeadroomPerHost' to leave headroom on reserved memory for Esx kernel processes that may need to allocate memory to complete a task. The relevant KB for this parameter is KB 320796.