North-South Connectivity Loss During Leaf Switch Upgrade due to icen Driver Heap Exhaustion
search cancel

North-South Connectivity Loss During Leaf Switch Upgrade due to icen Driver Heap Exhaustion

book

Article ID: 435656

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Tenant Virtual Machines (VMs) may experience a loss of North-South network connectivity during physical leaf switch maintenance or upgrades. While East-West connectivity typically remains functional, traffic destined for external networks is dropped.

This issue is specifically observed on ESXi hosts utilizing the icen driver for Intel NICs. During a switch upgrade, the driver fails to update the link status of the physical NIC (vmnic), leading to a state where the host continues to sending traffic out of a vmnic to an upstream leaf switch which is not available. Issue seen on ICEN driver 1.14.2.0 on firmware 4.50 but may not be isolated to these versions.

Symptoms:

  • Loss of connectivity to VMs from outside the NSX/VCF environment.
  • The connectivity is restored to those VMs when the leaf switch upgrade has completed. 
  • Affected hosts do not report "link down" or "link up" notifications in the vmkernel logs at the expected times.
  • The following error messages are present in /var/log/vmkernel.log:

    WARNING: Heap: 3645: Heap pfHeap-icen already at its maximum size. Cannot expand
    WARNING: icen: icen_GetLinkStatus:1268: XXXX:XX:00.0: Failed to get link state - status: ICE_ERR_NO_MEMORY
    WARNING: icen: icen_CleanControlQ:6817: XXXX:XX:00.0: Failed to allocate heap for the Admin queue event
    WARNING: icen: icen_LinkEvent:2626: XXXX:XX:00.0: Failed to get LLDP status from firmware, Status: Out of memory

Environment

VMware ESXi

Cause

The root cause is a memory leak within the icen driver's private heap (pfHeap-icen).

When the driver reaches its maximum heap size, it can no longer allocate memory for the Admin Queue or handle Link Events. Consequently, if a physical switch goes down (e.g., for an upgrade), the driver cannot process the interrupt or status change. Because the driver never "sees" the link go down, it does not notify the ESXi stack. The host continues to pin VM traffic to the vmnic that is connected to the upgrading and unavailable leaf switch, and fails to initiate a failover to the secondary physical adapter.

Resolution

To resolve this issue, you must address the driver-level memory management failure:

  1. Update the icen Driver: Check the VMware Compatibility Guide and your hardware vendor's support portal for the latest version of the icen driver that includes fixes for heap memory management and leaks.
  2. Hardware Vendor Engagement: If you are already on the latest recommended version, engage your hardware vendor (e.g., Dell, HPE, Intel) to provide a driver or firmware combination that resolves the ICE_ERR_NO_MEMORY condition.
  3. Reboot Affected Hosts: As a temporary workaround to clear the heap exhaustion, reboot the ESXi host. Note that the heap may eventually exhaust again if the underlying leak is not patched.

Additional Information

Subscribe to this knowledge article to get updates on this issue.