ESXi 5.x and 6.0 disconnects from vCenter Server with the error: WorkHeap already at its maximum size

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

An ESXi 5.0, 5.1, 5.5, or 6.0 host disconnects from the vCenter Server.
This issue occurs until you reboot the host.
Multiple Citrix virtual machines are PXE booting.
This issue occurs when multiple Citrix virtual machines have networking connected to a vSphere Distributed Switch (VDS) while booting.
When the Citrix virtual machines are booting, the ESXi 5.x or 6.0 host may fail with a purple diagnostic screen.
The vmkwarning.log file (located in /var/log/) contains entries similar to:

cpu4:1698394)WARNING: Heap: 2638: Heap WorkHeap already at its maximum size. Cannot expand.
cpu4:1698394)WARNING: Heap: 3019: Heap_Align(WorkHeap, 696/696 bytes, 64 align) failed. caller: 0x418027e46931
The vmkernel.log file (located in /var/log/) contains multiple disable/enable port statements:

cpu4:1698404)NetPort: 2747: resuming traffic on DV port 1444
cpu4:1698404)NetPort: 1380: enabled port 0x2000146 with mac xx:xx:xx:xx:xx:xx
cpu4:1698404)NetPort: 1574: disabled port 0x2000146
cpu13:16540)VmkEvent: 88: Msg to hostd failed with timeout, dropping function 2081 len 56
The vmkernel.log file (located in /var/log/) contains Workheap exhaustion messages similar to:

2014-01-20T06:04:37.670Z cpu5:345858)WARNING: Heap: 2638: Heap WorkHeap already at its maximum size. Cannot expand.
2014-01-20T06:04:37.670Z cpu5:345858)WARNING: Heap: 3019: Heap_Align(WorkHeap, 160/160 bytes, 8 align) failed. caller: 0x418000ef8570
2014-01-20T06:04:37.670Z cpu5:345858)WARNING: Heap: 2638: Heap WorkHeap already at its maximum size. Cannot expand.
2014-01-20T06:04:37.670Z cpu5:345858)WARNING: Heap: 3019: Heap_Align(WorkHeap, 160/160 bytes, 8 align) failed. caller: 0x418000ef8570
2014-01-20T06:04:37.670Z cpu5:345858)WARNING: Heap: 2638: Heap WorkHeap already at its maximum size. Cannot expand.
The hostd.log file (located in /var/log/) contains entries similar to:

2014-01-20T06:01:58.661Z [58365B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-10.vmx' opID=d24fadba-50] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)
2014-01-20T06:01:58.661Z [58C80B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-13.vmx' opID=163db60a-89] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)
2014-01-20T06:01:59.345Z [583A6B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-3.vmx' opID=5baa06c-82] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)
… 2014-01-20T06:02:13.975Z [58365B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-28.vmx' opID=86084156-7] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)
2014-01-20T06:02:16.615Z [583E7B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-12.vmx' opID=e3c574ea-8b] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)
2014-01-20T06:02:20.718Z [58324B90 info 'Vmsvc.vm:/vmfs/volumes/c690cfe2-f9d585b6/DATASTORE/TESTVM-42.vmx' opID=83d57bc3-b3] State Transition (VM_STATE_OFF -> VM_STATE_POWERING_ON)

Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.5
VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.0

Cause

This issue is caused by excessive heap allocation from virtual machines due to the virtual NIC flapping between enabled and disabled.

When an ESXi host connects to a vSphere Distributed Switch (VDS) and there are virtual machines connected to the dvPort, if the virtual NIC (vNIC) of a virtual machine has several LINKUP/LINKDOWN events in very quick sucession.

Note: This is an abnormal or an "attacking" behavior from the vNIC of the VM.

The frequent LINKUP/LINKDOWN events cause the VDS to send out enormous VmkEventMsg alerts about the LINKUP/LINKDOWN event on the dvPort. These messages can consume all available free memory in the Work Heap and the ESXi host can display a purple diagnostic screen.

When a virtual machine configured to PXE boot and using Citrix Provisioning Server while attached to a VDS it causes many link up/link down messages to be generated in the logs. Should multiple virtual machines perform this action, a situation can arise wherein log messages are generated faster than they can be written. This leads to heap exhaustion and deadlocks and ultimately host failure.

Resolution

This issue is resolved in ESXi 5.1 Update 2. You can download the latest version from the VMware downloads page. For more information, see the VMware ESXi 5.1 Update 2 Release Notes.

This issue is resolved in ESXi 5.5 Update 1. You can download the latest version from the VMware downloads page. For more information, see the VMware ESXi 5.5 Update 1 Release Notes.

Notes:

The fix only alleviates the cause of the issue (virtual NIC flapping between enabled and disabled).
DVS, hostd, and vmx events may fail to bt posted to the vCenter Server through hostd due to the fix.
This issue can occur if you are using Citrix Provisioning Services software. For more information, see Citrix Knowledge Center article CTX135769, VMware 5.0 Host Machines might Experience a Purple Diagnostic Screen when running Provisioning Services 6.1 Targets.

To work around this issue:

Ensure the host NIC drivers are up-to-date. For more information, see the VMware Compatibility Guide.
Suspend the use of vSphere High Availability (HA) and vSphere Distributed Resource Scheduler (DRS) on the affected vSphere cluster.
Move all the virtual machines to a vSphere Standard Switch (VSS). For more information on increasing the amount of virtual machines that can be provisioned, see Citrix Knowledge Center article CTX131993, vSphere 5 Support for Provisioning Server 5.6.x and 6.0.

Additional Information

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box

To obtain the MAC address of the virtual machines causing this issue, run the command:

grep 'enabled port' vmkernel.log | awk '{print $9}' | sort | uniq -c

Note: The links in this article were correct as of March 4, 2014. If you find a link is broken, provide feedback and a VMware employee will update the link.