ESXi host becomes unresponsive on high latency in NVMe over TCP storage
search cancel

ESXi host becomes unresponsive on high latency in NVMe over TCP storage

book

Article ID: 432595

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • In the vSphere Client, ESXi hosts, which are configured with an NVMe over TCP adapter, go "Not Responding" intermittently
  • Subset of VMs may experience task failures/hung behavior at random.
  • Running esxcli or df -h on the host may hang or take several minutes to respond.
  • In the vSphere Client, tasks become stuck in "Running" state, often at 0% or 20% (e.g. 'Refresh Storage System', 'Update option values' etc.)

 

Environment

VMware vSphere ESXi 8.0 

Cause

Packet drops on NVMe/TCP uplinks cause storage to repeatedly disconnect and reconnect, leading to high latency and hostd issues.

hostd.log indicates the daemon is stuck/taking excessive time on trying to read VMs/threads from the storage:

YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, stat("/vmfs/volumes/########-########-####-############/path") took over 2714 sec.
YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, open("/vmfs/volumes/########-########-####-############/path") took over 2814 sec.
YYYY-MM-DDTHH:MM:SS.###Z warning hostd[<pid>] [Originator@6876 sub=IoTracker] In thread <thread id>, access("/vmfs/volumes/########-########-####-############/path") took over 4049 sec.

  • I/O errors specific to NVMe may also appear in the logs: 

YYYY-MM-DDTHH:MM:SS.###Z  Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 1 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z  Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 2 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z  Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 3 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z  Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 4 (expect 65535) not available, nvmeStatus 80e
YYYY-MM-DDTHH:MM:SS.###Z  Wa(180) vmkwarning: cpu21:2098710)WARNING: NVMEIO:3649 Ctlr 263, nvmeCmd 0x45bade2e3800 (opc 02), queue 5 (expect 65535) not available, nvmeStatus 80e

Note: The above excerpts are an example and opcodes/error sequence may vary.

  • Packet drop/error count on running esxcli network nic stats get -n vmnic against the vmnic used for NVMe/TCP:
NIC statistics for vmnic#:
      Packets received: 158585149637
      Packets sent: 48571680247
      Bytes received: 234866890345430
      Bytes sent: 49571177363751
      Receive packets dropped: 22671
      Transmit packets dropped: 0
      Multicast packets received: 2745626351
      Broadcast packets received: 9117019440
      Multicast packets sent: 8172746
      Broadcast packets sent: 2139484
      Total receive errors: 5909

Note: In a healthy environment, errors should be zero or statistically negligible relative to the total.

Resolution

Engage the hardware vendor to investigate the physical network adapter (NIC) errors. Because these errors occur at the hardware level and are merely passed up to the ESXi host, vendor assistance is required to determine the root cause.

Read more on, Troubleshooting and understanding physical NIC receive or transmit dropped, missed and error counters in ESXi