Ongoing APD (all paths down) issues across fibre channel impacting the ESXi host connectivity to storage
search cancel

Ongoing APD (all paths down) issues across fibre channel impacting the ESXi host connectivity to storage

book

Article ID: 392072

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x

Issue/Introduction

  • ESXi host connectivity to storage becomes impacted by devices reporting All Paths Down (APD) conditions across the Fibre Channel fabric.
    Upon further investigation dropped frames are detected on the Host Bus Adapter (HBA) or the wider SAN infrastructure.

  • Checking fibre channel events shows this occurring across only a single HBA.  The example below shows dropped frames on only vmhba3.

# localcli storage san fc events get

[root@<hostname>:/var/log] localcli storage san fc events get
FC Event Log
------------
YYYY-MM-DD HH:12:09.303 [vmhba3] Dropped frames (925696 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (645120 of 774 bytes) on <target/lun id> cmd: 0x28 
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (514048 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (948224 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (1044480 of 774 bytes) on <target/lun id> cmd: 0x28 
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (780288 of 774 bytes) on <target/lun id> cmd: 0x28 
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (989184 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (907264 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (800768 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (454656 of 774 bytes) on <target/lun id> cmd:0x28 
YYYY-MM-DD HH:12:10.518 [vmhba3] Dropped frames (923648 of 774 bytes) on <target/lun id> cmd:0x28

 

  • The same dropped frames are also visible from vmkernel.log and only across vmhba3 when counting all instances of dropped frames per HBA.

# grep Dropped /var/log/vmkernel.log |awk '{print $3}' |sort |uniq -c

YYYY-MM-DDTHH:17:08.561z cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (8:104) Dropped frame (s) detected (1015808 of 1048576 bytes). 
YYYY-MM-DDTHH:17:08.5612 cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (1:4) Dropped frame (s) detected (1046528 of 1048576 bytes). 
YYYY-MM-DDTHH:17:08.5612 cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (8:105) Dropped frame (s) detected (804864 of 1048576 bytes). 
[root@<hostname>:/var/log] grep Dropped vmkernel.log | awk '{print $3}' sort uniq -c 
        352 vmhba3 (##:0.0):

 

Environment

VMware vSphere ESXi 8.x
VMware vSphere ESXi 9.x

Cause

  • Dropped frames are Fibre Channel (FC) packets on the switch fabric that are not received between the ESXi host and the storage array.
  • This can cause performance issues on the VMs running on the impacted datastores. 

  • If the connection to the datastore is lost altogether, then the VMs may be in a degraded state and running in the ESXi host memory. 

  • Dropped FC frames are typically a hardware issue.  The source will usually be the host HBA, the switch fabric or the connection to the storage array.

Resolution

Removing the faulty connection should allow the datastore to reconnect if access was lost and stabilize the environment.

Typically a rescan of the HBA would not be required as failover to the redundant HBA should be automatic. However, If the datastore does not reconnect automatically then a rescan of the adapters can be attempted. 

If a rescan doesn't bring back the redundant HBA then disabling the faulty HBA paths from out-of-band management interface (iDRAC, iLO, BMC, etc) can stabilize the environment to bring production back online.  If this option is not available, then disable all paths to the impacted HBA via ESXi CLI.

  • Check all HBA path states

    # localcli storage core path list

  • Disable all paths for a specific HBA, (e.g vmhba3.)

    # localcli storage core path list |grep "Runtime Name:" |grep vmhba3 |awk '{print $3}' |while read line; do localcli storage core path set --path $line --state off; done
  • To disable a single path manually

    # localcli storage core path set -p <UID or the runtime name of the path,vmhba3:C0:T0:L0> --state off

Confirm the paths show "State: off" or "State: dead" and reboot the host

Once the environment has been stabilized and production restored, ensure that the HBA, the fabric, and the backend storage array are reviewed to find the source of the dropped frames.
Once that is remedied, the connection to the datastore can be reconfigured to have redundant paths.