ESXi hosts lose connectivity to NVMe-over-TCP datastores after NetApp ONTAP upgrade
search cancel

ESXi hosts lose connectivity to NVMe-over-TCP datastores after NetApp ONTAP upgrade

book

Article ID: 435300

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0 VMware vSphere ESXi

Issue/Introduction

After upgrading NetApp ONTAP storage to version 9.15.1P15, ESXi hosts may lose all connectivity to NVMe-over-TCP datastores. 

You might see the following symptoms:

  • The storage devices appear as "Dead" or "Error" in the vSphere Client
  • The NVMe controllers report as offline.

Environment

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Cause

  • This is caused by an incomplete NVMe-oF session teardown during storage upgrades, a known bug in the ONTAP NVMe/TCP stack documented in NetApp KB CONTAP-503989.(Netapp account is required to view this linked KB).
  • This timing issue causes the ESXi host to hang on a stale path and fail to automatically re-establish a connection.

 

Resolution

To restore connectivity without rebooting the ESXi host, you may manually clear the hung sessions and re-establish the NVMe fabric connections using the below commands,
 
1. SSH to host
2. Identify the Affected Controllers
Example:-

 

[root@hostname:~] esxcli nvme controller list

The output would be similar to the one below:

Name                                                                                                                          Controller Number  Adapter  Transport Type  Is Online  Controller Type  Is VVOL  Keep Alive Timeout  IO Queue Number  IO Queue Size
----------------------------------------------------------------------------------------------------------------------------  -----------------  -------  --------------  ---------  ---------------  -------  ------------------  ---------------  -------------
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                256  vmhba64  TCP                 false  I/O                false                  10                4             32
nqn.1992-08.com.netapp:sn.399xxxxxxx:subsystem#vmhba65<target_IP>:4420                257  vmhba65  TCP                 false  I/O                false                  10                4             32
3. Disconnect and Reconnect the Fabric
[root@hostname:~] esxcli nvme fabrics disconnect -a <Adapter_Name> -s <Subsystem_NQN>

eg: esxcli nvme fabrics disconnect -a vmhba64 -s nqn.1992-08.com.netapp:sn.399xxxxx:subsystem

4. Re-establish the connection using the specific storage IP and port (default 4420):

esxcli nvme fabrics connect -a <vmhba_adapter> -i <target_IP> -p 4420 -s <subsystem_nqn>

eg:esxcli nvme fabrics connect -a vmhba64 -i <target_IP> -p 4420 -s nqn.1992-08.com.netapp:sn.399xxxxx:subsystem

 

5. Rescan the storage adapters:

esxcli storage core adapter rescan --all

 

6. Storage controllers appears online

 [root@hostname:~] esxcli nvme controller list

Name                                                                                                                          Controller Number  Adapter  Transport Type  Is Online  Controller Type  Is VVOL  Keep Alive Timeout  IO Queue Number  IO Queue Size
----------------------------------------------------------------------------------------------------------------------------  -----------------  -------  --------------  ---------  ---------------  -------  ------------------  ---------------  -------------
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                    258  vmhba64  TCP                  true  I/O                false                  10                2            128
nqn.1992-08.com.netapp:sn.399xxxxx:subsystem#vmhba64#<target_IP>:4420                    259  vmhba65  TCP                  true  I/O                false                  10                2            128