All Storage paths down after switch reboot, NVMe RDMA with Intel NICs

Products

VMware vSphere ESXi

Issue/Introduction

NVME RDMA enabled
Intel NICs E810-XXVDA2 or E810-XXVDA4 in use with driver/firmware equal to or below version icen-2.1.1.0 + irdman-1.5.1.0 + NVM 4.7
After switch maintenance which requires a reboot all storage paths are down across all hosts in the cluster
A reboot of the hosts brings the paths back online
Driver/firmware version icen 1.15.4.0 firmware 4.6 may result in the hosts PSODing with the below backtrace

2025-06-27T02:28:43.524Z cpu19:9098185)Backtrace for current CPU #19, worldID=9098185, fp=0x43204a6f4990
2025-06-27T02:28:43.524Z cpu19:9098185)0x453a2a81bdf0:[0x42000863f679]irdma_irq_spinlock_acquire@(irdman)#<None>+0x1 stack: 0x431da4456480, 0x404a817760, 0x4321d0562c40, 0x4321d0562dd0, 0x4321d0562c48
2025-06-27T02:28:43.524Z cpu19:9098185)0x453a2a81be00:[0x4200086431d8]irndrv_RDMAOpPollComplQueue@(irdman)#<None>+0x49 stack: 0x4321d0562c40, 0x4321d0562dd0, 0x4321d0562c48, 0x369, 0x43204a817130
2025-06-27T02:28:43.524Z cpu19:9098185)0x453a2a81bed0:[0x420008347626]vmk_RDMAPollComplQueue@com.vmware.rdma#1+0x43 stack: 0x42000834760c, 0x453a2a81bfa0, 0x0, 0x0, 0x17
2025-06-27T02:28:43.524Z cpu19:9098185)0x453a2a81bf10:[0x4200085fe5c6]nr_CompletionWorld@(nvmerdma)#<None>+0xeb stack: 0x43204a6d1080, 0x43204a8291d0, 0x0, 0x43200bad0069, 0x43204a8ae850

In vmkernel.log the below messages are seen:

2025-05-21T00:10:04.792Z In(182) vmkernel: cpu0:2097582)NVMFEVT:330 Received event 0 (0x4313e9dc4900) for vmhba## event queue.
2025-05-21T00:13:11.792Z In(182) vmkernel: cpu0:2097582)NVMFEVT:330 Received event 1 (0x4313e9dc4900) for vmhba## event queue.
2025-05-21T00:36:10.554Z Wa(180) vmkwarning: cpu35:2099251)WARNING: irdman: irndrv_RDMAOpAllocFastRegPageList:5813: PF Reset ongoing. Operation cannot be executed.
2025-05-21T00:36:10.554Z In(182) vmkernel: cpu35:2099251)nvmerdma:1602 [ctlr 266, queue 0] failed to allocate fast reg page list: Failure
2025-05-21T00:36:10.554Z In(182) vmkernel: cpu35:2099251)nvmerdma:411 [ctlr 266, queue 0] failed to allocate FRMR: Failure
2025-05-21T00:36:10.554Z In(182) vmkernel: cpu35:2099251)nvmerdma:886 [ctlr 266, queue 0] Failed to reset: Failure
2025-05-21T00:36:10.554Z In(182) vmkernel: cpu35:2099251)nvmerdma:1928 [ctlr 266, queue 0] reset failed: Failure
2025-05-21T00:36:10.554Z In(182) vmkernel: cpu35:2099251)NVMEDEV:7939 Controller 266, queue 0 reset complete. Status Failure
2025-05-21T00:36:10.554Z Wa(180) vmkwarning: cpu35:2099251)WARNING: NVMEDEV:8276 Failed to restart admin queue for controller 266, status: Failure

Environment

VMware vSphere ESX

NVMe RDMA

Cause

There are two issues causing this:

An unexpected qp event due to the switch reboot causes the Intel driver to trigger a PF reset.
Nvmerdma driver has some uncleaned resources during Intel driver PF reset.

Resolution

Both VMware and Intel engineering are aware of this issue and are actively working on code fix (VMware) and a new driver/firmware (Intel) to resolve this issue.

Workaround
Don't enable NVMe/RDMA until a fix is available