All Paths Down reported during switch reboot activity despite having redundant switches

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

During a scheduled reboot of Fibre Channel switches, ESXi hosts lost connectivity to all datastores.
The environment is configured for redundancy with two HBAs per host, split between Fabric A and Fabric B.

The issue occurred following this sequence:
1. Fabric A Edge and Core switches were rebooted and restored.
2. Fabric B Edge and Core switches were rebooted immediately after.
3. Despite the redundancy, the ESXi host failed to failover to the restored Fabric A, resulting in an All Paths Down condition.
Connectivity was restored only after manually toggling (shut/no-shut) the Edge switch ports.

Environment

VMware ESXi 8.x

VMware ESX 9.x

Cause

The issue stemmed from a failure of the storage paths on Fabric A to recover following the reboot of the Fabric A Core switch. Although the initial reboot of the Fabric A Edge switch resulted in successful path recovery, the subsequent Core switch reboot caused the paths to remain in a dead state.

Because Fabric B switch was rebooted while Fabric A was still unavailable, all redundancy was lost, triggering an All Paths Down (APD) condition.

Cause Validation

The following excerpts from /var/run/log/vmkernel.log capture the vmhba Link Down and Link Up events reported during the edge switch reboots:

2025-11-13T21:05:29.937Z Wa(180) vmkwarning: cpu1:2098630)WARNING: lpfc : vmhba1 lpfc_mbx_cmpl_read_topology:1800: 1305 Link Down Event x2 received Data: x2 x20 x400220 x0
2025-11-13T21:07:13.817Z In(182) vmkernel: cpu1:2098630)lpfc : vmhba1 lpfc_mbx_cmpl_read_topology:1759: 1303 Link Up Event x5 received Data: x5 x0 x90 x0 x0 -----> Fabric A edge switch reboot

2025-11-13T21:51:10.731Z Wa(180) vmkwarning: cpu2:2098617)WARNING: lpfc : vmhba2 lpfc_mbx_cmpl_read_topology:1800: 1305 Link Down Event x2 received Data: x2 x20 x400220 x0
2025-11-13T21:52:55.379Z In(182) vmkernel: cpu56:2098617)lpfc : vmhba2 lpfc_mbx_cmpl_read_topology:1759: 1303 Link Up Event x5 received Data: x5 x0 x90 x0 x0 -----> Fabric B edge switch reboot

Timestamps in /var/run/vobd.log show the sequence of events where the vmhba1 paths transitioned to a dead state and successfully recovered upon link restoration:

2025-11-13T21:05:39.940Z In(14) vobd[2098148]: [scsiCorrelator] 1628153725651us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba1:C0:T11:L30 changed state from on (device ID: naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx)
2025-11-13T21:07:15.630Z In(14) vobd[2098148]: [scsiCorrelator] 1628249012865us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba1:C0:T11:L30 changed state from dead

Subsequently, at 21:24, the paths associated with vmhba1 failed again and did not recover. As shown in the logs below, no Link Down events were recorded during this timeframe.

This absence of vmhba link down events indicates that the Core switch was rebooted. The ESXi host only detects a vmhba link down when the directly connected Edge switch is rebooted:

/var/run/log/vmkernel.log 2025-11-13T21:24:49.179Z In(182) vmkernel: cpu5:2098695)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x0 (0x45bb104b8e40, 0) to dev
"naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx" on path "vmhba1:C0:T11:L30" Failed:
2025-11-13T21:24:49.179Z In(182) vmkernel: cpu5:2098695)NMP: nmp_ThrottleLogForDevice:3898: H:0x1 D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x453a5151bbc8 CmdSN 0x0

/var/run/log/vobd.log
2025-11-13T21:24:49.181Z In(14) vobd[2098148]: [scsiCorrelator] 1629302962592us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba1:C0:T11:L30 changed state from on (device ID: naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx)

At 21:51, the Fabric B Edge switch was rebooted, causing the paths on vmhba2 to go down. Because the Fabric A paths were still offline, this triggered an All Paths Down (APD) state.

2025-11-13T21:51:20.733Z In(14) vobd[2098148]: [scsiCorrelator] 1630894507636us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba2:C0:T11:L30 changed state from on (device ID: naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx)
2025-11-13T21:52:56.699Z In(14) vobd[2098148]: [scsiCorrelator] 1630990474039us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T11:L30 changed state from dead

The logs below show the device entering the APD state and subsequently exiting it once the Fabric B link was restored

2025-11-13T21:51:20.744Z In(182) vmkernel: cpu61:2098030)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx] has entered the All Paths Down state.
2025-11-13T21:52:56.653Z In(182) vmkernel: cpu27:2098030)StorageApdHandlerEv: 113: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx] has exited the All Paths Down state.

Finally, when the Fabric B Core switch was rebooted, the paths associated with vmhba2 went down again causing the devices to enter All Paths Down state.

/var/run/log/vobd.log
2025-11-13T22:06:43.887Z In(14) vobd[2098148]: [scsiCorrelator] 1631817628637us: [vob.scsi.scsipath.pathstate.deadver2] scsiPath vmhba2:C0:T11:L30 changed state from on (device ID: naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx)

/var/run/log/vmkernel.log
2025-11-13T22:06:43.854Z In(182) vmkernel: cpu34:2098030)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx] has entered the All Paths Down state.

To resolve the APD state, an administrative port toggle (shut/no-shut) was performed on the Edge switches of both fabrics, which successfully brought the paths back online.

/var/run/log/vmkernel.log
2025-11-13T22:25:30.916Z Wa(180) vmkwarning: cpu60:2098617)WARNING: lpfc : vmhba2 lpfc_mbx_cmpl_read_topology:1800: 1305 Link Down Event x6 received Data: x6 x20 x400220 x0
2025-11-13T22:25:37.693Z In(182) vmkernel: cpu60:2098617)lpfc : vmhba2 lpfc_mbx_cmpl_read_topology:1759: 1303 Link Up Event x7 received Data: x7 x0 x90 x0 x0 -----> Fabric B edge switch port reset
2025-11-13T22:26:03.941Z Wa(180) vmkwarning: cpu0:2098630)WARNING: lpfc : vmhba1 lpfc_mbx_cmpl_read_topology:1800: 1305 Link Down Event x6 received Data: x6 x20 x400220 x0
2025-11-13T22:26:11.774Z In(182) vmkernel: cpu30:2098630)lpfc : vmhba1 lpfc_mbx_cmpl_read_topology:1759: 1303 Link Up Event x7 received Data: x7 x0 x90 x0 x0 -----> Fabric A edge switch port reset

2025-11-13T22:25:38.093Z In(182) vmkernel: cpu42:2098030)StorageApdHandlerEv: 113: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxx] has exited the All Paths Down state.

Resolution

As the logs indicate, the ESXi host did not receive the necessary state change notifications from the fabric. This issue requires analysis by the switch hardware vendor.

Engage your Fibre Channel Switch Vendor to investigate why Registered State Change Notifications (RSCNs) were not correctly propagated to the Edge switch ports when the Core switch came back online.