ESXi Host Management Unresponsive and VM Deadlock due to Fibre Channel SFP Degradation
search cancel

ESXi Host Management Unresponsive and VM Deadlock due to Fibre Channel SFP Degradation

book

Article ID: 436435

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms: 

  • Multiple Virtual Machines (VMs) become non-responsive and are inaccessible via network protocols (Ping/SSH/RDP).

  • Attempts to access the VM console are unsuccessful. VM console gets frozen.  

  • Log Evidence (/var/log/vmkernel.log):

    • Received RSCN followed by SCSI Sense Code 0x6 0x29 0x07 (I/O Nexus Loss):

      2026-03-27T20:46:22.456Z In(182) vmkernel: cpu32:2097933)lpfc : vmhba2 lpfc_issue_gidft:2188: fc4 type 1
      2026-03-27T20:46:22.468Z In(182) vmkernel: cpu32:2097933)lpfc: lpfc_cmpl_els_prli:2050: vmhba2 0103 PRLI completes to NPort x140200 Data: x0 xf0160 x14 x0 x1
      2026-03-27T20:46:22.468Z In(182) vmkernel: cpu32:2097933)lpfc : vmhba2 lpfc_cmpl_prli_prli_issue:1915: 6028 FCP NPR PRLI Cmpl DID 140200 Init 1 Tgt 1 EIP 1 AccCode x1

    • Abort Storms and HBA Buffer exhaustion (XRI Starvation):

      2026-03-27T21:00:00.977Z Wa(180) vmkwarning: cpu8:7971965)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a3566430, xri x373
      2026-03-27T21:00:01.002Z Wa(180) vmkwarning: cpu16:7971967)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a331a430, xri x1c0
      2026-03-27T21:00:01.238Z Wa(180) vmkwarning: cpu8:7971965)WARNING: lpfc : vmhba2 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45d9a330b430, xri x1cf

    • Impacted VM events 

2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3772: handle 9356242661154838(GID:8214)(vscsi0:1):processing reset for handle ... state 1381192706
2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3772: handle 9356341441208332(GID:8204)(vscsi0:0):processing reset for handle ... state 1381192706
2026-03-27T21:02:47.962Z In(182) vmkernel: cpu19:2097635)VSCSI: 3879: handle 9356341441208332(GID:8204)(vscsi0:0):Reset [Retries: 1/0] from (vmm0:vmname)

    • Host Heartbeat (HBX) timeouts on seemingly unrelated datastores:

      HBX: 3089: 'DATASTORE_NAME': HB at offset ... - Waiting for timed out HB

    • Mailbox timeout (0x20004a01) due to buffer exhaustion (0x218)

2026-03-27T20:59:03.586Z Wa(180) vmkwarning: cpu50:2097919)WARNING: lpfc : vmhba2 lpfc_sli4_eratt_read:8275: 2885 Port Status Event: port status reg 0x81800000, port smphr reg 0xc000, error 1=0x20004a01, error 2=0x218

Environment

VMware vSphere ESXi 8.x 
VMware vSphere ESX  9.x 

Cause

This issue is caused by a Fibre Channel "Port Flapping" resulting from a degraded SFP transceiver on the fabric switch.

Unlike a "Hard Failure" (where the link goes down), a degraded SFP may maintain physical synchronization (laser-active) while intermittently corrupting or dropping Fibre Channel frames.

This creates a catastrophic condition for the ESXi storage stack:

  1. NMP Inability to Failover: Because the physical link state remains "Up," the Native Multipathing Plugin (NMP) does not mark the path as "Dead."

  2. The Abort Rule: Per T10 SCSI standards, ESXi cannot retry a timed-out I/O on a healthy path until it receives an "Abort Successful" confirmation from the array on the original path.

  3. HBA Queue Exhaustion: Since the degraded SFP drops the "Abort Request" frames, the host's HBA hardware buffers become permanently occupied by pending aborts.

  4. Noisy Neighbor Impact: Once the HBA's hardware queues are 100% saturated, the HBA cannot process I/O for any storage array, including healthy targets sharing the same physical HBA.

Resolution

Involving the SAN vendor for the further investigation is necessary 

  • Zoning Hygiene: Unmask and unzone any LUNs not actively in use. ESXi periodically probes all presented LUNs (SCSI Inquiry 0x12); removing unused LUNs reduces the background "noise" that can trigger queue exhaustion during fabric instability.

  • Fabric Monitoring: Configure the FC switch fabric to utilize Port Guard or similar features that automatically disable ports exceeding defined CRC/Encoding error thresholds.

Additional Information