ESXi 8.x host fails to recover NVMe over TCP controller upon a reboot of the array controller.

Products

VMware vSphere ESXi 8.0

Issue/Introduction

When an NVMe over TCP controller is rebooted on the array for any reason such as a firmware upgrade, an ESXi 8.x host fails to recover the dead paths, unlike a ESXi 7.x host.

On ESXi 7.x host shows paths are marked dead when a controller on the array is down during a firmware upgrade, vmkernel.log - /var/run/log/vmkernel.log

YYYY-MM-DDT19:18:46.215Z cpu0:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "active" to "dead"
YYYY-MM-DDT19:18:46.216Z cpu17:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "active" to "dead"

Recovered when the array controller is back after being down for a short period:

YYYY-MM-DDT19:21:43.774Z cpu0:2102437)WARNING: NVMEDEV:9569 Controller 263 recovery already active.
YYYY-MM-DDT19:21:55.526Z cpu0:2102332)WARNING: NVMEDEV:9569 Controller 257 recovery already active.

YYYY-MM-DDT19:22:04.753Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
YYYY-MM-DDT19:22:04.753Z cpu1:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "dead" to "active"
YYYY-MM-DDT19:22:04.755Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
YYYY-MM-DDT19:22:04.755Z cpu10:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "dead" to "active"

On ESXi 8.0.3, the host fails to connect to the array controller, which was down for a short period:

YYYY-MM-DDT19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L1 as NO_CONNECT
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1
YYYY-MM-DDT19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L21 as NO_CONNECT
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1

YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 2
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 2, controller 263
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 3
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 3, controller 263

YYYY-MM-DDT19:18:45.144Z In(182) vmkernel: cpu4:2097504)HPP: HppNvmeUpdateNamespaces:535: Marking paths dead - controller:nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx

YYYY-MM-DDT19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFDEV:887 Controller 263 found while it's being deleted.
YYYY-MM-DDT19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:172 target type: NVMe
YYYY-MM-DDT19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:180 vmkParams.asqsize: 31
YYYY-MM-DDT19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFEVT:1070 Failed to connect controller nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx, status: Failure

Environment

VMware ESXi 8.0 U3

Cause

With ESXi 8.0 providing persistent discovery controller support, the host is running into this issue due to an unhandled scenario in the driver, resulting in a stale controller in the system. ESXi 7.0.3 does not run into this issue because it does not have persistent discovery controller support.

The existence of a persistent discovery controller can be verified with the command: esxcli nvme controller list

Resolution

The issue is fixed in ESXi 8.0 Update 3e.

Additional Information

Release Notes for ESXi 8.0 Update 3e - VMware ESXi 8.0 Update 3e Release Notes