ESXi Storage Connectivity Loss due to Simultaneous Fibre Channel Link Drops leading to All paths down.
search cancel

ESXi Storage Connectivity Loss due to Simultaneous Fibre Channel Link Drops leading to All paths down.

book

Article ID: 441670

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

An ESXi host experiences a sudden and total loss of storage connectivity.

  • Multiple storage devices intermittently entering an **All Paths Down (APD)** state.
  • Production Virtual Machines (VMs) may become unresponsive or enter a read-only I/O error state.
  • Storage paths may recover automatically after a short duration (e.g., 10–20 seconds), but application-level disruption persists.

 

Environment

VMware ESXi 8.x

Cause

The root cause is a physical or fabric-layer connection drop impacting multiple active Fibre Channel adapters simultaneously.

When independent HBAs (e.g., vmhba0 and vmhba1) record physical link-down events within the same millisecond window, it indicates a failure in a shared upstream component. This is typically caused by:

  • SAN switch reboots (intentional or accidental).
  • Power or backplane failure on a shared storage array controller.
  • Simultaneous maintenance on redundant fabric paths (Fabric A and Fabric B).

Validation:

Review the /var/run/log/vmkernel.log for the following pattern to verify if the issue is upstream from the ESXi host:

1. Simultaneous Link Down Events:
    WARNING: lpfc: lpfc_mbx_cmpl_read_topology:3696: vmhba0 1305 Link Down Event x2 received
    WARNING: lpfc: lpfc_mbx_cmpl_read_topology:3696: vmhba1 1305 Link Down Event x2 received

    Note: If these occur within less than 200ms of each other, it proves a shared point of failure. 

2. Device Loss Timeouts:
   The driver initiates a 10-second timer to wait for fabric recovery:
   WARNING: lpfc: lpfc_start_devloss:4565: vmhba0 3248 Start 10 sec devloss tmo
   WARNING: lpfc: lpfc_start_devloss:4565: vmhba1 3248 Start 10 sec devloss tmo
 
3. APD Entry:
   If the links do not recover within the timeout, the host enters APD:
   StorageApdHandlerEv: 106: Device [naa.xxx] has entered the All Paths Down state.

Resolution

1. Provide the exact timestamp of the Link Down events to your SAN/Fabric team.
2. Audit Switch Logs: Request an audit of Cisco MDS or Brocade switch logs for any Registered State Change Notifications (RSCNs) or reboot events.
3. Inspect Physical Layer: Check SFPs, optical cabling, and patch panels for any signs of structural defect or signal degradation (low light levels).
4. Host Reboot: If the host remains sluggish or management agents are hung after fabric recovery, a reboot of the ESXi host may be required to clear residual APD references.

Additional Information

All Paths Down for a storage device