ESXi host becomes unresponsive on high latency in NVMe over TCP storage

Products

VMware vSphere ESXi

Issue/Introduction

In the vSphere Client, ESXi hosts, which are configured with an NVMe over TCP adapter, go "Not Responding" intermittently
Subset of VMs may experience task failures/hung behavior at random.
Running esxcli or df -h on the host may hang or take several minutes to respond.
In the vSphere Client, tasks become stuck in "Running" state, often at 0% or 20% (e.g. 'Refresh Storage System', 'Update option values' etc.)

Environment

VMware vSphere ESXi 8.0

Cause

Packet drops on NVMe/TCP uplinks cause storage to repeatedly disconnect and reconnect, leading to high latency and hostd issues.

hostd.log indicates the daemon is stuck/taking excessive time on trying to read VMs/threads from the storage:

YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, stat("/vmfs/volumes/########-########-####-############/path") took over 2714 sec.
YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, open("/vmfs/volumes/########-########-####-############/path") took over 2814 sec.
YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, access("/vmfs/volumes/########-########-####-############/path") took over 4049 sec.

I/O errors specific to NVMe may also appear in the logs:

YYYY-MM-DDTHH:MM:SS.###Z Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 1 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 2 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 3 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 4 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 5 (expect 65535) not available, nvmeStatus 80e

Note: The above excerpts are an example and opcodes/error sequence may vary.

Packet drop/error count on running esxcli network nic stats get -n vmnic against the vmnic used for NVMe/TCP:

NIC statistics for vmnic#:
Packets received: 158585149637
Packets sent: 48571680247
Bytes received: 234866890345430
Bytes sent: 49571177363751
Receive packets dropped: 22671
Transmit packets dropped: 0
Multicast packets received: 2745626351
Broadcast packets received: 9117019440
Multicast packets sent: 8172746
Broadcast packets sent: 2139484
Total receive errors: 5909

Note: In a healthy environment, errors should be zero or statistically negligible relative to the total.

Resolution

Engage the hardware vendor to investigate the physical network adapter (NIC) errors. Because these errors occur at the hardware level and are merely passed up to the ESXi host, vendor assistance is required to determine the root cause.