ESXi Host Management Unresponsive and VM Deadlock: Emulex lpfc Driver

Products

VMware vSphere ESXi

Issue/Introduction

A specific condition in the Emulex lpfc driver can lead to complete ESXi host management failure and Virtual Machine (VM) deadlocks. This typically occurs during SCSI device discovery when a storage controller is rejoined to a fabric, causing the driver's worker thread to become overwhelmed and leading to "XRI Starvation."

Symptoms

All VMs on the impacted hosts become non-responsive and inaccessible (including ping failure).
Datastores are no longer accessible or visible in the vSphere Client UI.
Host management becomes unresponsive; tasks such as "Enter Maintenance Mode" hang at 0%.
Standard troubleshooting commands (e.g., df -h, esxcli, vdf) freeze or hang.
Skyline Health may report "Host with connectivity issues."
The issue typically manifests approximately 15 minutes after re-adding a storage controller to the fabric.
Log Evidence (Review /var/log/vmkernel.log and /var/log/vobd.log)

VMFS Heartbeat Timeout:

2026-05-14T14:16:23.953Z In(14) vobd[2097955]: [vmfsCorrelator] [vob.vmfs.heartbeat.timedout] 69e8dfd8-########-####-############ [DATASTOR_NAME]

Abort Storms / XRI Starvation:

2026-05-14T14:16:23.958Z Wa(180) vmkwarning: cpu46:2098429)WARNING: lpfc : vmhba4 lpfc_validate_fcp_abort:7541: 3111 Outstanding FCP I/O Abort Request still pending on io_buf 0x45daad3af430, xri x728

Port Status Errors:

2026-05-14T14:16:17.923Z Wa(180) vmkwarning: cpu11:2098175)WARNING: lpfc : vmhba4 lpfc_sli4_eratt_read:8275: 2885 Port Status Event: port status reg 0x81800000, port smphr reg 0xc000, error 1=0x2e004a01, error 2=0x218

I/O Stuck Notification:

2026-05-14T14:18:23.064Z Wa(180) vmkwarning: cpu2:9761735)WARNING: ScsiDeviceIO: 13515: IO stuck on device naa.600507680c8104f8b80000000000037c for more than 120000 seconds

Environment

VMware vSphere ESXi 8.x
Driver: Emulex lpfc (Versions prior to 14.4.576.11)
Hardware: Emulex LightPulse LPe31000/LPe32000 series HBA
Impacted Storage: IBM Flash System, Hitachi Fibre Channel arrays, or similar multi-controller arrays.

Cause

The lpfc worker thread processes fabric discovery and I/O completions. When a storage controller is re-added, the driver initiates a SCSI scan. If the REPORT LUN command fails or experiences high latency, the driver calls the ESX API vmk_ScsiScanAndClaimPaths() directly within the worker thread, causing it to hang. This leads to mailbox timeouts, buffer exhaustion (XRI Starvation) and a host-wide deadlock where all I/O is blocked.

Resolution

The issue is resolved in lpfc driver version 14.4.576.11 and higher. The fix offloads the ESX API call to the pathclaim world, preventing discovery failures from blocking the primary driver service thread.

Identify the current lpfc driver version running on the ESXi host.
Download version 14.4.576.11 or higher from the Broadcom Support Portal.
Install the updated driver following standard maintenance procedures.
Verify the certified driver and firmware combination using the VMware Compatibility Guide.

Workaround

If an immediate driver update is not possible:

Host Reboot: A physical reboot of the impacted ESXi host is required to clear hardware buffer saturation and restore management.
Staggered Joins: Avoid large fabric changes or re-adding multiple storage controllers while the hosts are under heavy I/O load.