Ongoing APD (all paths down) issues across fiber channel impacting the ESXi host connectivity to storage

Products

VMware vSphere ESXi

Issue/Introduction

Checking fiber channel events shows this occurring across only a single HBA. The example below shows dropped frames on only vmhba3.

# localcli storage san fc events get

[root@<hostname>:/var/log] localcli storage san fc events get
FC Event Log
------------
YYYY-MM-DD HH:12:09.303 [vmhba3] Dropped frames (925696 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (645120 of 774 bytes) on <target/lun id> cmd: 0x28
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (514048 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:09.607 [vmhba3] Dropped frames (948224 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (1044480 of 774 bytes) on <target/lun id> cmd: 0x28
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (780288 of 774 bytes) on <target/lun id> cmd: 0x28
YYYY-MM-DD HH:12:09.911 [vmhba3] Dropped frames (989184 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (907264 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (800768 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:10.214 [vmhba3] Dropped frames (454656 of 774 bytes) on <target/lun id> cmd:0x28
YYYY-MM-DD HH:12:10.518 [vmhba3] Dropped frames (923648 of 774 bytes) on <target/lun id> cmd:0x28

The same dropped frames are also visible from vmkernel.log and only across vmhba3 when counting all instances of dropped frames per HBA.

# grep Dropped /var/log/vmkernel.log |awk '{print $3}' |sort |uniq -c

YYYY-MM-DDTHH:17:08.561z cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (8:104) Dropped frame (s) detected (1015808 of 1048576 bytes).
YYYY-MM-DDTHH:17:08.5612 cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (1:4) Dropped frame (s) detected (1046528 of 1048576 bytes).
YYYY-MM-DDTHH:17:08.5612 cpu23:2097502) qlnativefc: vmhba3 (##:0.0): qlnativefcStatusEntry:1922: (8:105) Dropped frame (s) detected (804864 of 1048576 bytes).
[root@<hostname>:/var/log] grep Dropped vmkernel.log | awk '{print $3}' sort uniq -c
352 vmhba3 (##:0.0):

Environment

VMware vSphere ESXi (All Versions)

Cause

Dropped frames are Fibre Channel (FC) packets on the switch fabric that are not received between the ESXi host and the storage array. This can cause performance issues on the VMs running on the impacted datastores. If the connection to the datastore is lost altogether, then the VMs may be in a degraded state and running in the ESXi host memory.

Dropped FC frames are typically a hardware issue. The source will usually be the host HBA, the switch fabric or the connection to the storage array.

Resolution

Removing the faulty connection should allow the datastore to reconnect if access was lost and stabilize the environment. Typically a rescan of the HBA would not be required as failover to the redundant HBA should be automatic. If the datastore does not reconnect automatically however, then a rescan of the adapters can be attempted.

If a rescan doesn't bring back the redundant HBA then disabling the faulty HBA paths from out-of-band management interface (iDRAC, iLO, BMC, etc) can stabilize the environment to bring production back online. If this option is not available, then disable all paths to the impacted HBA vi ESXi CLI.

Check all HBA path states

# localcli storage core path list

Disable all paths for a specific HBA, example below for vmhba3.

# localcli storage core path list |grep "Runtime Name:" |grep vmhba3 |awk '{print $3}' |while read line; do localcli storage core path set --path $line --state off; done

The syntax to run the path set manually is below.

# localcli storage core path set -p <UID or the runtime name of the path,vmhba3:C0:T0:L0> --state off

Confirm the paths show "State: off" or "State: dead" and reboot the host

Once the environment has been stabilized and production restored, ensure that the HBA, the fabric, and the backend storage array are reviewed to find the source of the dropped frames. Once that is remedied, the connection to the datastore can be reconfigured to have redundant paths.