VMs intermittently lose network connectivity on hosts with the ICEN driver and EDP standard mode enabled

Products

VMware vSphere ESXi VMware NSX

Issue/Introduction

VM randomly becomes unreachable on a particular host. Pings to the VM fail both at the layer 2 and later 3 level.
VM is reachable again after migration to a different ESXi host.
Power cycling the VM, as in turning it on and off, does not change the out come of the issue observed.
Downing the uplinks of the hosts and bringing them back up also does not change the out come of the issue observed.
You have EDP standard mode enabled on the host.
Physical Nic driver in use is: icen version 1.15.2.0 or 1.14.x (Determining Network/Storage firmware and driver version in ESXi)
If you capture packets on the physical NIC's of the host running the source VM, while using a filter to capture packets with the source VM's IP, then you see the packets appear to be leaving the NIC. A typical packet capture command would be something like the following:
- pktcap-uw --uplink vmnicX --capture UplinkRcvKernel,UplinkSndKernel --rcf "geneve and host < source VM IP>" -o - | tcpdump-uw -ner -
When the same packet capture command is run on the physical NIC on the destination ESXi host, then the expected packets from the source VM are observed to not be delivered.
If you temporarily disable EDP on the destination host using the command 'esxcfg-vswitch -Y <vswitch name>', then the connectivity to the VM is restored.
It still works after enabling the EDP mode again on the host using the command 'esxcfg-vswitch -y <vswitch name>'.
Rebooting the destination host prevents further issues with VMs on that host losing connectivity, but the issue returns again after a period of time.
When the physical NIC packet counters are checked using the following command on the host, it is observed that the rx drop or errors counters are not incrementing:
- esxcli network nic stats get -n vmnicX
On the destination host the /var/run/log/vmkernel.log displays failure to setup pNic queues:
icen: indrv_EnsSetupRxQueue:1264: 0000:af:00.0: Failed to set up Rx queue 16. Shared queue data already exists, Status: VMK_FAILURE
On the destination host the /var/run/log/vmkernel.log displays failure in heap allocation:
WARNING: Heap: 3645: Heap pfHeap-icen already at its maximum size. Cannot expand
WARNING: Heap: 4105: Heap_Align(pfHeap-icen, 81920/81920 bytes, 64 align) failed. caller: 0x42002a38d1e5
The following log might be repeatedly reported in the host /var/run/log/vmkernel.log:
Wa(180) vmkwarning: cpu1:2097581)WARNING: icen: indrv_UplinkPrivStatsGet:3632: rxQueue->vsi is a NULL pointer!

Environment

VMware vSphere ESXi 8.0
VMware vSphere ESXi 9.0
VMware NSX 4.2.X

Cause

When EDP is enabled on the ICEN driver, then there is an issue were the driver does not release heap memory properly, leading to heap exhaustion. This issue causes NIC operation failures and connectivity problems.
The icen driver allocates small chunks in the heap when requesting firmware to apply or remove filter for Geneve/VXLAN. After the apply command the heap allocations are not freed, these cumulative allocations leads to heap exhaustion and failure to setup receive filters.

Resolution

The issue is fixed in icen driver 2.3.3.0 for ESXi 8.x and driver 2.3.3.0 for ESXi 9.x . Reach out to your hardware vendor for more details on the fixed versions..
The temporary workaround is to disable EDP-Standard on the cluster until a fixed version of the driver is available. The steps for disabling EDP are the following:

Create a new Transport Node Profile with standard mode.
Select a host in the cluster and place it into maintenance mode.
Update the host\transport node in NSX UI System -> Fabric -> Hosts to use standard mode, wait for its state becomes success.
Exit maintenance mode on the host.
Repeat steps 2 to 4 for all the hosts in the cluster.
After all TNs in the cluster are updated successfully to standard, select the cluster in NSX UI System -> Fabric -> Hosts, change its TNP to the new standard TNP created in step 1.

They are the same steps as the manual steps for enabling EDP-Standard described in this Tech Doc, but you are enabling standard instead. Please consult with the referenced technical documentation for more details.

Additional Information

For more information on EDP standard refer to the technical documentation:

Download and install async drivers in VMware ESXi
Determining Network/Storage firmware and driver version in ESXi