NVMe over RoCE “I/O stall: ESXi 8.0u3e host NVMEIO Transport driver failed to submit cmd”
search cancel

NVMe over RoCE “I/O stall: ESXi 8.0u3e host NVMEIO Transport driver failed to submit cmd”

book

Article ID: 408046

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When using NVMe over RoCE on Mellanox 100 and 200 Gb/s NICs VMs hang and cannot reboot until migrated to another host.
  • The host where the problem occurred must be rebooted to re-establish connection and allow VMs to boot.
  • Logs show the issue occurring:

2025-06-03T17:27:13.249Z Wa(180) vmkwarning: cpu19:1049527)WARNING: NVMEIO:2023 Transport driver failed to submit cmd 0x45994efc4e00, C: nqn.1992-08.com.######:####.600#############################vmhba###192.#.##.###:####, Q: 2 <0xbad0014>. 2025-06-03T17:27:13.249Z Wa(180) vmkwarning: cpu19:1049527)WARNING: NVMEPSA:217 Complete vmkNvmeCmd: 0x45994efc4e00, vmkPsaCmd: 0x459a13816980, cmdId.initiator=0x430957e18580, CmdSN: 0x8000001b, status: 0x801

Environment

ESXi 8.x or 9.x
NVMe over RoCE

Cause

Due to a limitation in the current nvme/rdma driver IO queue size over 256 leads to a bounce buffer out of memory condition.
This can be seen in the vmkernel log:

2025-08-01T10:31:30.585Z In(182) vmkernel: cpu19:1048643)nvmerdma:3767 [ctlr 261, queue 1] cmd 0x457938583640, failed to allocate bounce buffer: Out of memory

Resolution

The driver will be enhanced in future releases of 8.0 U3 and 9.0 to better handle large IO queue size.

As a workaround use the following command to set the queue size on the host to 256 or less, or set the queue size at the storage target using vendor specific commands. Once the IO queue size is set to 256 or less the stalls stop occurring. 

  • Check the queue size, by default it has a blank value:
    esxcli system module parameters list -m vmknvme
    Name                                               Type   Value   Description
    -------------------------------------             ----     -----      -----------
    vmknvme_io_queue_size                 uint               IO queue size: [8, 1024]
    vmknvme_total_io_queue_size        uint                Aggregated IO queue size of a controller, MIN: 64, MAX: 4096
  • Set the queue size: 
    esxcli system module parameters set -m vmknvme -p vmknvme_io_queue_size=256

  • Check the queue size again:
    esxcli system module parameters list -m vmknvme
    Name                                   Type  Value  Description
    -------------------------------------  ----  -----  -----------
    vmknvme_io_queue_size                  uint  256    IO queue size: [8, 1024]