Multiple VMs rebooted due to a vSphere HA trigger following network switch maintenance

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

During a scheduled network fabric upgrade, a storage pathing redundancy failure occurred on two ESXi nodes. This resulted in a total loss of storage connectivity (All Paths Down), triggering vSphere High Availability (HA) isolation responses and subsequent unplanned reboots of production Virtual Machines.
Few ESXi hosts triggered HA isolation responses; where as some of them never had impact
Impacted hosts exhibited a significantly lower iSCSI path count compared to the cluster baseline
No physical hardware defects or component failures were identified on the ESXi hosts or local NICs
Affected hosts entered an All Paths Down (APD) state, rendering the datastores inaccessible
The host regain the access post the first switch came back up after upgrade
LUN shows 4 paths instead of 8 paths on working host

naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea : XXXX iSCSI Disk (naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea)
vmhba68:C2:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68 Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000003 PortalTag=1
vmhba68:C3:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68 Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000004 PortalTag=1
vmhba68:C0:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68 Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000001 PortalTag=1
vmhba68:C1:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68 Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000002 PortalTag=1

All 4 paths were down when one vmnic associated to one switch went down

yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu14:2098701)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc287fc240, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C2:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu14:2098701)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc287fc240, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C3:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu27:2098714)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc28778840, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C1:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu 4:2098691)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc26f4d700, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C0:T0:L250" Failed:

Host experienced APD

yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu5:2098414)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea] has entered the All Paths Down state.

Environment

ESX 9.x
ESXi 8.x

Cause

The absence of iSCSI Port Binding, which resulted in random session distribution. Consequently, all active iSCSI sessions were inadvertently pinned to a single physical uplink, creating a single point of failure that led to an All Paths Down (APD) state when the associated switch was rebooted.

Resolution

Configure Port Binding with associate the iSCSI VMkernel interfaces with their respective physical NICs. Configuring Port Binding forces the ESXi host to establish and maintain active sessions through every bound VMkernel interface.
This configuration ensures that the failure of a single physical switch or NIC only results in a partial path loss; the host remains connected via the alternate fabric.
By maintaining at least 50% of the storage paths, the host avoids the APD state, thereby preventing HA isolation events and unplanned VM reboots.
Perform a storage adapter rescan to ensure the software initiator establishes redundant sessions across both physical fabrics.

Note: It is recommended to put the host in Maintenance mode, clear all the old iSCSI sessions by removing the discovery IPs . If required reboot the host to clear all stale sessions.

Reference: