Shared Datastores Missing on an ESXi Host with storage adapters in Offline state

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

The connectivity to storage is FC and the storage adapters are in offline state

Issue Validation

One or more hosts may be affected by the issue.
The driver and firmware versions installed for the impacted adapter are supported with the ESXi version installed in the host and is same as the ones installed on the hosts where there are no issues reported

For reference on how to check the versions, please refer to Determining Network/Storage firmware and driver version in ESXi.
A storage rescan was performed but the status of the adapter still remains offline
The vmhba was reset from the ESXi with the command below but the issue persists

# esxcli storage san fc reset -A vmhbaX
A host reboot does not resolve the issue

Environment

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Cause

The issue indicates problems beyond ESXi, likely related to the fabric layer connectivity.

Cause Validation:

On validating the /var/log/vmkernel.log file on the impacted ESXi host, we can see H:0x1 events indicating physical connectivity issues.

YYYY-MM-DDTHH:MM:SS.9862 cpu33:2098295) NMP: nmp_ThrottleLogForDevice:3861: Ox0 (0x45c9c0fb5e40, 0) to dev "naa.XXXXXXXXXXXXXXXXXXXXXXX" on path "vmhba64:C0:T3:L46" Failed:
YYYY-MM-DDTHH:MM:SS.9862 cpu33:2098295) NMP: nmp ThrottleLogForDevice:3869: H:0x1 D:0x0 P:0x0 Act:NONE. cmdId.initiator=0x4539be39bc48 CmdSN 0x0
YYYY-MM-DDTHH:MM:SS.9862 opu63:2097821) ScsiVmas: 1057; Inquiry for VPD page 00 to device naaXXXXXXXXXXXXXXXXXXXXXXXX failed with error No connection
YYYY-MM-DDTHH:MM:SS.9862 epu28:159343769 qedf:vmhba64:qedfc_transport check:1124:Error: [0:2:46]:returning VMK_SCSI_HOST_NO_CONNECT: cmd 0:0:0:0:0
YYYY-MM-DDTHH:MM:SScpu63:2097821) ScsiVmas: 1057: Inquiry for VPD page 00 to device naa.XXXXXXXXXXXXXXXXXX failed with error No connection
YYYY-MM-DDTHH:MM:SS42 cpu20:159343769) WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1937: Could not select path for device "naa.XXXXXXXXXXXXXXXXXXXXXXXX"

In addition to this we see FLOGI timeout indicating issues at the fabric layer. FLOGI, also known as Fabric Login, is a process that occurs when a server or Host Bus Adapter (HBA) powers on and initiates communication with a locally attached Fiber Channel (FC) switch. The process begins when the server or HBA sends a FLOGI request to the switch. When the switch receives the FLOGI request, it assigns a 24-bit FCID (Fibre Channel ID) to the server or HBA. Similar to an IP address, the FCID helps to route traffic between the switch and storage.

YYYY-MM-DDTHH:MM:SS.0272 cpu66:2090327) WARNING: ql_fcoe:vmhba64: FipFabricLoginCompletion:1820 FipFabricLoginCompletion: Fabric 0x4311cca76000 FIP FLOGI timeout ExchangeTries 1
YYYY-MM-DDTHH:MM:SS.0272 cpu66:2098327) ql_fcoe:vmhba64:FabricLoginCompletion: 1358: Info: FabricLoginCompletion: Fabric 0x4311cca76000 FLOGI/FDISC status Timeout
YYYY-MM-DDTHH:MM:SS.0272 cpu66:2098327) ql_fcoe:vmhba64: FipFabricLoginCompletion:1814: Info: Called for Fabric: 10:00:d8:1f:xx:xx:xx:xx
YYYY-MM-DDTHH:MM:SS.0272 cpu66:2098327) WARNING: ql_fcoe:vmhba64: FipFabricLoginCompletion:1833: Status not ok bad0001
YYYY-MM-DDTHH:MM:SS.0272 cpu66:2098327) ql_fcoe:vmhba64:FabricLoginCompletion:1358:Info: FabricLoginCompletion: Fabric 0x4311cca76000 FLOGI/FDISC status failure

The EMULEX HBA will show below FLOGI fail messages

YYYY-MM-DDTHH:MM:SS.0272 cpu66:2090327) vmkwarning: cpu8:2098244)WARNING: lpfc: lpfc_process_flogi_failure:954: vmhba4 2858 FLOGI failure Status:x3/x103 TMO:x14 flag:x810020 x1a81800 x0 IDs:abc501:0
The importance of Flogi in SAN lies in its ability to establish communication between devices and ensure proper traffic routing. Without FLOGI, devices would not be able to connect to the SAN and share data.

Resolution

Engage the fabric vendor to investigate and resolve the issue within the fabric layer. After addressing the problem, reboot the affected hosts to refresh adapter information.