ESXi 8.x host fails to recover NVMe over TCP controller upon a reboot of the array controller.
search cancel

ESXi 8.x host fails to recover NVMe over TCP controller upon a reboot of the array controller.

book

Article ID: 385701

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

When a NVMe over TCP controller is rebooted on the array for any reason such as firmware upgrade, ESXi 8.x host fails to recover the dead paths unlike a 7.x host.

  • VMkernel log from 7.x host show  paths are marked dead when a controller on the array is down during a firmware upgrade:
    2025-01-02T19:18:46.215Z cpu0:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "active" to "dead"
    2025-01-02T19:18:46.216Z cpu17:2097464)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "active" to "dead"

Recovered when the array controller is back after being down for a short period:
2025-01-02T19:21:43.774Z cpu0:2102437)WARNING: NVMEDEV:9569 Controller 263 recovery already active.
2025-01-02T19:21:55.526Z cpu0:2102332)WARNING: NVMEDEV:9569 Controller 257 recovery already active.

2025-01-02T19:22:04.753Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
2025-01-02T19:22:04.753Z cpu1:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L1" state changed from "dead" to "active"
2025-01-02T19:22:04.755Z cpu8:2102223)NvmeDiscover: 671: controller = nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx
2025-01-02T19:22:04.755Z cpu10:2097312)HPP: HppPathGroupMovePath:644: Path "vmhba66:C0:T0:L21" state changed from "dead" to "active"

 

  • However on 8.0.3, host fails to connect to the array controller which was down for a short period:
    2025-01-02T19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L1 as NO_CONNECT
    2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1
    2025-01-02T19:18:44.685Z Wa(180) vmkwarning: cpu2:2122519)WARNING: NvmeDiscover: 5489: Mark path vmhba66:C0:T0:L21 as NO_CONNECT
    2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEPSA:1631 adpater: vmhba66, action: 1

2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 2
2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 2, controller 263
2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5534 Controller 263, destroy namespace 3
2025-01-02T19:18:44.685Z In(182) vmkernel: cpu2:2122519)NVMEDEV:5570 Destroyed namespace 3, controller 263

2025-01-02T19:18:45.144Z In(182) vmkernel: cpu4:2097504)HPP: HppNvmeUpdateNamespaces:535: Marking paths dead - controller:nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx#vmhba66#10.10.xxx.xxx:xxxx

2025-01-02T19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFDEV:887 Controller 263 found while it's being deleted.
2025-01-02T19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:172 target type:    NVMe
2025-01-02T19:22:01.945Z In(182) vmkernel: cpu8:2122650)NVMFDEV:180 vmkParams.asqsize:   31
2025-01-02T19:22:01.945Z Wa(180) vmkwarning: cpu21:2122651)WARNING: NVMFEVT:1070 Failed to connect controller nqn.2010-06.com.xxxxxxxxxxx:flasharray.xxxxxxxxxxxxxxx, status: Failure

Environment

VMware vSphere 8.0 U3

Cause

With 8.0 providing persistent discovery controller support, host is running into this issue due to an unhandled scenario in the driver resulting in a stale controller in the system.
7.0.3 is not running into this issue because it does not have persistent discovery controller support.

Existence of persistent discovery controller can be verified with command: esxcli nvme controller list 

Resolution

VMware is aware of the issue, a fix will be provided in a future release.