When an NVMe over TCP controller is rebooted on the array for any reason such as a firmware upgrade, an ESXi 8.x host fails to recover the dead paths, unlike a ESXi 7.x host.
/var/run/log/vmkernel.log
YYYY-MM-DDT19:18:46.215Z cpu0:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "active" to "dead"
YYYY-MM-DDT19:18:46.216Z cpu17:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "active" to "dead"
YYYY-MM-DDT19:21:43.774Z cpu0:2102437)WARNING: NVMEDEV:9569 Controller 263 recovery already active.
YYYY-MM-DDT19:21:55.526Z cpu0:2102332)WARNING: NVMEDEV:9569 Controller 257 recovery already active.
YYYY-MM-DDT19:22:04.753Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
YYYY-MM-DDT19:22:04.753Z cpu1:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "dead" to "active"
YYYY-MM-DDT19:22:04.755Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
YYYY-MM-DDT19:22:04.755Z cpu10:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "dead" to "active"
YYYY-MM-DDT19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L1 as NO_CONNECT
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1
YYYY-MM-DDT19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L21 as NO_CONNECT
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 2
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 2, controller 263
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 3
YYYY-MM-DDT19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 3, controller 263
YYYY-MM-DDT19:18:45.144Z In(182) vmkernel: cpu4:2097504)HPP: HppNvmeUpdateNamespaces:535: Marking paths dead - controller:nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
YYYY-MM-DDT19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFDEV:887 Controller 263 found while it's being deleted.
YYYY-MM-DDT19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:172 target type: NVMe
YYYY-MM-DDT19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:180 vmkParams.asqsize: 31
YYYY-MM-DDT19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFEVT:1070 Failed to connect controller nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx, status: Failure
VMware ESXi 8.0 U3
With ESXi 8.0 providing persistent discovery controller support, the host is running into this issue due to an unhandled scenario in the driver, resulting in a stale controller in the system. ESXi 7.0.3 does not run into this issue because it does not have persistent discovery controller support.
The existence of a persistent discovery controller can be verified with the command: esxcli nvme controller list
The issue is fixed in ESXi 8.0 Update 3e.
Release Notes for ESXi 8.0 Update 3e - VMware ESXi 8.0 Update 3e Release Notes