ESXi Host Management Unresponsive and VM Deadlock due to Fibre Channel SFP Degradation

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Multiple Virtual Machines (VMs) become non-responsive and are inaccessible via network protocols (Ping/SSH/RDP).
Attempts to access the VM console are unsuccessful. VM console gets frozen.
Log Evidence (/var/log/vmkernel.log):
- Received RSCN followed by SCSI Sense Code 0x6 0x29 0x07 (I/O Nexus Loss):
  2026-03-27T20:46:22.456Z In(182) vmkernel: cpu32:2097933)lpfc : vmhba2 lpfc_issue_gidft:2188: fc4 type 1 2026-03-27T20:46:22.468Z In(182) vmkernel: cpu32:2097933)lpfc: lpfc_cmpl_els_prli:2050: vmhba2 0103 PRLI completes to NPort x140200 Data: x0 xf0160 x14 x0 x1 2026-03-27T20:46:22.468Z In(182) vmkernel: cpu32:2097933)lpfc : vmhba2 lpfc_cmpl_prli_prli_issue:1915: 6028 FCP NPR PRLI Cmpl DID 140200 Init 1 Tgt 1 EIP 1 AccCode x1
- Abort Storms and HBA Buffer exhaustion (XRI Starvation):
  2026-03-27T21:00:00.977Z Wa(180) vmkwarning: cpu8:7971965)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a3566430, xri x373
  2026-03-27T21:00:01.002Z Wa(180) vmkwarning: cpu16:7971967)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a331a430, xri x1c0
  2026-03-27T21:00:01.238Z Wa(180) vmkwarning: cpu8:7971965)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a330b430, xri x1cf
- Impacted VM events

2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3772: handle 9356242661154838(GID:8214)(vscsi0:1):processing reset for handle ... state 1381192706
2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3772: handle 9356341441208332(GID:8204)(vscsi0:0):processing reset for handle ... state 1381192706
2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3879: handle 9356341441208332(GID:8204)(vscsi0:0):Reset [Retries: 1/0] from (vmm0:vmname)

- Host Heartbeat (HBX) timeouts on seemingly unrelated datastores:
  HBX: 3089: 'DATASTORE_NAME': HB at offset ... - Waiting for timed out HB
- Mailbox timeout (0x20004a01) due to buffer exhaustion (0x218)

2026-03-27T20:59:03.586Z Wa(180) vmkwarning: cpu50:2097919)WARNING: lpfc : vmhba2 lpfc_sli4_eratt_read:8275: 2885 Port Status Event: port status reg 0x81800000, port smphr reg 0xc000, error 1=0x20004a01, error 2=0x218

Environment

VMware vSphere ESXi 8.x
VMware vSphere ESX 9.x

Cause

This issue is caused by a Fibre Channel "Port Flapping" resulting from a degraded SFP transceiver on the fabric switch.

Unlike a "Hard Failure" (where the link goes down), a degraded SFP may maintain physical synchronization (laser-active) while intermittently corrupting or dropping Fibre Channel frames.

This creates a catastrophic condition for the ESXi storage stack:

NMP Inability to Failover: Because the physical link state remains "Up," the Native Multipathing Plugin (NMP) does not mark the path as "Dead."
The Abort Rule: Per T10 SCSI standards, ESXi cannot retry a timed-out I/O on a healthy path until it receives an "Abort Successful" confirmation from the array on the original path.
HBA Queue Exhaustion: Since the degraded SFP drops the "Abort Request" frames, the host's HBA hardware buffers become permanently occupied by pending aborts.
Noisy Neighbor Impact: Once the HBA's hardware queues are 100% saturated, the HBA cannot process I/O for any storage array, including healthy targets sharing the same physical HBA.

Resolution

Involving the SAN vendor for the further investigation is necessary

Zoning Hygiene: Unmask and unzone any LUNs not actively in use. ESXi periodically probes all presented LUNs (SCSI Inquiry 0x12); removing unused LUNs reduces the background "noise" that can trigger queue exhaustion during fabric instability.
Fabric Monitoring: Configure the FC switch fabric to utilize Port Guard or similar features that automatically disable ports exceeding defined CRC/Encoding error thresholds.