ESXi host becomes unresponsive NVMe over RDMA storage due to high latency

Products

VMware vSphere ESXi

Issue/Introduction

Some ESXi hosts accessing NVMe over RDMA storage become unreponsive
The affected hosts have to be rebooted to recover
The issue recurs after some time
Aborts and latency are observed on the ESXi hosts

Environment

VMware vSphere ESXi 8.0.x

Cause

This arises where there is extreme, sustained latency on the NVMe devices, and this subsequently causes I/O failures and I/O aborts

Logging similar to the following may be observed:
/var/log/vmkernel.log reports I/O latency beginning:

vmkwarning: cpu2:2098293)WARNING: StorageDeviceIO: 201: Device eui.################################# performance has deteriorated. I/O latency increased from average value of 2675 microseconds to 57947 microseconds.
vmkwarning: cpu2:2097265)WARNING: StorageDeviceIO: 201: Device eui.################################# performance has deteriorated. I/O latency increased from average value of 2676 microseconds to 56284 microseconds.
vmkwarning: cpu32:2098294)WARNING: StorageDeviceIO: 201: Device eui.################################# performance has deteriorated. I/O latency increased from average value of 808 microseconds to 22666 microseconds.
vmkwarning: cpu32:2098294)WARNING: StorageDeviceIO: 201: Device eui.################################# performance has deteriorated. I/O latency increased from average value of 764 microseconds to 23598 microseconds
....
Later, as latency continues to build, there are compare failures (NVMe status 0x285, relating to ATS commands):

vmkwarning: cpu23:2100488)WARNING: NVMEIO:2645 command 0x45dab4c14480 failed: ctlr 262, queue 3, psaCmd 0x45dab53f14c0, status 0x285, opc 0x5, cid 2, nsid 39
vmkwarning: cpu23:2100488)WARNING: NVMEPSA:217 Complete vmkNvmeCmd: 0x45dab4c14480, vmkPsaCmd: 0x45dab53f14c0, cmdId.initiator=0x430b0855b4c0, CmdSN: 0x85fd581, status: 0x285

Also, I/O aborts:

vmkernel: cpu4:2097851)nvmerdma:4222 [ctlr 262, queue 0] abortCmd 0x45baeac29540, sqid 5, cid 79, new cid 23.
vmkernel: cpu18:2098309)NVMEIO:3974 Ctlr 262, ns 39, tmReq 0x431d310407e0, type 2, initiator 0x430fbe0599c0, sn 0x0, world id 4833892.
vmkernel: cpu3:2097854)NVMEIO:4654 ctlr 262, queue 7, cid 21, cap 0x1, count 0, found cmd 0x45dad262df40 (initiator 0x430fbe0599c0, serialNumber 0x800e0015, worldID 4833892)
Addiitionally the NVMe control admin queue may become inaccesible due to becoming full, e.g.:

vmkwarning: cpu14:2097853)WARNING: NVMEIO:4814 Failed to get reference to admin queue for controller 264.
vmkernel: cpu3:2098309)NVMEIO:3974 Ctlr 268, ns 22, tmReq 0x431d316b8d20, type 2, initiator 0x430b08519200, sn 0x0, world id 2097224.
vmkernel: cpu5:2097850)NVMEIO:4776 cmd2Abort 0x45baeaced740, opcode 0x2, nsid 22, lba 8057569296, lbc 511
vmkwarning: cpu33:2097849)WARNING: NVMEIO:3815 Ctlr 264, nvmeCmd 0x45dad272cb40, a:29, r:1, admin queue is full
vmkernel: cpu15:2097848)NVMEIO:4776 cmd2Abort 0x45bae7f48c00, opcode 0x2, nsid 22, lba 8057995296, lbc 111
vmkwarning: cpu8:2097851)WARNING: NVMEIO:3815 Ctlr 264, nvmeCmd 0x45bae7ef3e00, a:29, r:1, admin queue is full
vmkwarning: cpu6:2097855)WARNING: NVMEIO:3815 Ctlr 264, nvmeCmd 0x45baeacdcd40, a:29, r:1, admin queue is full

Resolution

Confirm NICs over which RDMA traffic is transferred have correct driver and firmware as per the Broadcom Compatibility Guide.
Confirm that Priority Flow Control (PFC) is enabled on these interfaces.

Run:
esxcli network nic dcb status get -n vmnic# | grep "PFC Enabled"
Investigate for causes of latency on the physical network and storage levels.