ESXi hosts lose connectivity to NVMe-over-TCP datastores after NetApp ONTAP upgrade

Products

VMware vSphere ESXi 8.0 VMware vSphere ESXi

Issue/Introduction

After upgrading NetApp ONTAP storage to version 9.15.1P15, ESXi hosts may lose all connectivity to NVMe-over-TCP datastores.

You might see the following symptoms:

The storage devices appear as "Dead" or "Error" in the vSphere Client
The NVMe controllers report as offline.

Environment

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Cause

This is caused by an incomplete NVMe-oF session teardown during storage upgrades, a known bug in the ONTAP NVMe/TCP stack documented in NetApp KB CONTAP-503989.(Netapp account is required to view this linked KB).
This timing issue causes the ESXi host to hang on a stale path and fail to automatically re-establish a connection.

Resolution

To restore connectivity without rebooting the ESXi host, you may manually clear the hung sessions and re-establish the NVMe fabric connections using the below commands,

1. SSH to host

2. Identify the Affected Controllers

Example:-

[root@hostname:~] esxcli nvme controller list

The output would be similar to the one below:

Name                                                                                                                          Controller Number  Adapter  Transport Type  Is Online  Controller Type  Is VVOL  Keep Alive Timeout  IO Queue Number  IO Queue Size
----------------------------------------------------------------------------------------------------------------------------  -----------------  -------  --------------  ---------  ---------------  -------  ------------------  ---------------  -------------
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                256  vmhba64  TCP                 false  I/O                false                  10                4             32
nqn.1992-08.com.netapp:sn.399xxxxxxx:subsystem#vmhba65<target_IP>:4420                257  vmhba65  TCP                 false  I/O                false                  10                4             32

3. Disconnect and Reconnect the Fabric

[root@hostname:~] esxcli nvme fabrics disconnect -a <Adapter_Name> -s <Subsystem_NQN>

eg: esxcli nvme fabrics disconnect -a vmhba64 -s nqn.1992-08.com.netapp:sn.399xxxxx:subsystem

4. Re-establish the connection using the specific storage IP and port (default 4420):

esxcli nvme fabrics connect -a <vmhba_adapter> -i <target_IP> -p 4420 -s <subsystem_nqn>

eg:esxcli nvme fabrics connect -a vmhba64 -i <target_IP> -p 4420 -s nqn.1992-08.com.netapp:sn.399xxxxx:subsystem

5. Rescan the storage adapters:

esxcli storage core adapter rescan --all

6. Storage controllers appears online

 [root@hostname:~] esxcli nvme controller list

Name                                                                                                                          Controller Number  Adapter  Transport Type  Is Online  Controller Type  Is VVOL  Keep Alive Timeout  IO Queue Number  IO Queue Size
----------------------------------------------------------------------------------------------------------------------------  -----------------  -------  --------------  ---------  ---------------  -------  ------------------  ---------------  -------------
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                    258  vmhba64  TCP                  true  I/O                false                  10                2            128
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                    259  vmhba65  TCP                  true  I/O                false                  10                2            128