No IO traffic seen on the switch and storage ports. Toggling the ports resumed the traffic but the IO traffic stopped after couple of minutes.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

No IO traffic flowing from switch or storage ports despite the ports being online in the Fabric. If the ports are toggled then the traffic resume for some time before stopping completely.
Following log excerpts are seen when the toggle happened from switch.

RSCN received.

YYYY-MM-DDTTT:SS.669Z In(182) vmkernel: cpu70:2098411)lpfc: lpfc_els_rcv_rscn:7907: vmhba4 0214 RSCN received Data: x800220 x0 x4 x1
YYYY-MM-DDTTT:SS:15.669Z In(182) vmkernel: cpu70:2098411)lpfc: lpfc_els_rcv_rscn:7914: vmhba4 5973 RSCN received event x0 : Address format x00 : DID ########

After couple of minutes, "Power on reset" on multiple devices.

YYYY-MM-DDTTT:SS.529Z In(182) vmkernel: cpu42:2098463)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0xa3 (0x##########, 0) to dev "naa.###############" on path "vmhba4:C0:T37:L7" Failed:
YYYY-MM-DDTTT:SS.529Z In(182) vmkernel: cpu42:2098463)NMP: nmp_ThrottleLogForDevice:3898: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE. cmdId.initiator=0x########### CmdSN 0x0

This is followed up with more aborts seen.

YYYY-MM-DDTTT:SS.781Z In(182) vmkernel: cpu87:2106775)lpfc: lpfc_handle_status:5631: vmhba4 3271: FCP cmd x2a failed <38/3> sid x01e700, did ##########, oxid x2338 iotag xe5e Abort Requested Host Abort Req
YYYY-MM-DDTTT:SS.781Z In(182) vmkernel: cpu87:2098464)NMP: nmp_ThrottleLogForDevice:3842: last error status from device naa.############# repeated 2 times
YYYY-MM-DDTTT:SS.781Z In(182) vmkernel: cpu87:2098464)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x2a (0x45da2a62d780, 2097272) to dev "naa.###################" on path "vmhba4:C0:T38:L3" Failed:
YYYY-MM-DDTTT:SS.781Z In(182) vmkernel: cpu87:2098464)NMP: nmp_ThrottleLogForDevice:3898: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=############ CmdSN ########

Post that lpfc_sli_abts_recover_port and devloss messages are noticed. lpfc_sli_abts_recover_port is called if there is no response for ABTS-LS (Aborts). This is followed with the scsi error (H:0x2 D:0x8 P:0x0 and H:0x1 D:0x0 P:0x0).

H:0x2 D:0x8 P:0x0 -> Device status 0x8 is returned when a LUN cannot accept SCSI commands at the moment
H:0x1 D:0x0 P:0x0 -> H:0x1 is NO_CONNECT, This status is returned if the connection is lost to the LUN.

YYYY-MM-DDTTT:SS.258Z Wa(180) vmkwarning: cpu59:2098411)WARNING: lpfc: lpfc_sli_abts_recover_port:11869: vmhba4 3094 Start rport recovery on sadapter id 0x3 fc_id ############ vpi 0x0 rpi 0x29 xri 0x2338 state 0x7 flags 0x80000000
YYYY-MM-DDTTT:SS.258Z Wa(180) vmkwarning: cpu59:2098411)WARNING: lpfc: lpfc_start_devloss:4565: vmhba4 3248 Start 10 sec devloss tmo WWPN 20:##:##:##:##:##:##:ca NPort ########
YYYY-MM-DDTTT:SS.258Z In(182) vmkernel: cpu42:5459514)lpfc: lpfc_handle_status:5631: vmhba4 3271: FCP cmd x9e failed <38/2> sid x01e700, did ########, oxid ######## iotag xbe9 Time Out Returning Host Busy

When checking more, we notice PLOGI failures on the ports where the toggle happened.

YYYY-MM-DDTTT:SS.555Z In(182) vmkernel: cpu71:2098411)lpfc: lpfc_els_retry:4864: vmhba4 0108 No retry ELS command x3 to remote NPORT ####### Retried:3 Error:x3/x2
YYYY-MM-DDTTT:SS:35.555Z Wa(180) vmkwarning: cpu71:2098411)WARNING: lpfc: lpfc_cmpl_els_plogi:2172: vmhba4 2753 PLOGI failure DID:####### Status:x3/x2 State: x1 Ref: 10 Flags: x40008

YYYY-MM-DDTTT:SS.345Z In(182) vmkernel: cpu64:2098411)lpfc: lpfc_els_retry:4864: vmhba4 0108 No retry ELS command x3 to remote NPORT ####### Retried:3 Error:x3/x2
YYYY-MM-DDTTT:SS:11.345Z Wa(180) vmkwarning: cpu64:2098411)WARNING: lpfc: lpfc_cmpl_els_plogi:2172: vmhba4 2753 PLOGI failure DID:####### Status:x3/x2 State: x1 Ref: 10 Flags: x40008

YYYY-MM-DDTTT:SS.804Z In(182) vmkernel: cpu74:2098411)lpfc: lpfc_els_retry:4864: vmhba4 0108 No retry ELS command x3 to remote NPORT ####### Retried:3 Error:x3/x2
YYYY-MM-DDTTT:SS.804Z Wa(180) vmkwarning: cpu74:2098411)WARNING: lpfc: lpfc_cmpl_els_plogi:2172: vmhba4 2753 PLOGI failure DID:####### Status:x3/x2 State: x1 Ref: 10 Flags: x40008

Above messages means that there is failure to establish a connection between a host and a target port in a Storage Area Network (SAN) environment. Log entries indicate a ‘Status:x3/x2,’ suggesting that the Host Bus Adapter (HBA) firmware did not receive a response for a Port Login (PLOGI) request. The PLOGI login request initiated by the host successfully reached the SAN, indicating that the connection from the host’s side was established. However, the response acknowledging the acceptance (ACC) of the PLOGI login request failed to reach the affected hosts, preventing the HBA from proceeding with the expected PRLI (Process Login) operation necessary for higher-level SCSI communication.

After waiting for 20 seconds without receiving a response, the HBA firmware rejected the PLOGI request due to a lack of timely response, indicating a communication problem between the host and target-side HBA.

Environment

VMware vSphere ESXi 8.x

Cause

In a Fibre Channel (FC) fabric switch, a single port issue can indeed cause widespread problems that affect other ports, and potentially the entire fabric. This is due to the interconnected nature and specific flow control mechanisms of Fibre Channel. An offending port on the fabric switch which show issues (CRC, Rx/Tx power) may cause issues to the other ports connected to single switch port (e.g., a server HBA, or a storage array port).

Resolution

There are no issues with the ESXi HBA. Check the SAN switch ports for possible issues and disable any offending port on the switch, as it may show errors when running the 'port show' or similar commands to retrieve the port statistics.