Multiple VMs rebooted due to a vSphere HA trigger following network switch maintenance
search cancel

Multiple VMs rebooted due to a vSphere HA trigger following network switch maintenance

book

Article ID: 428308

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms: 

  • During a scheduled network fabric upgrade, a storage pathing redundancy failure occurred on two ESXi nodes. This resulted in a total loss of storage connectivity (All Paths Down), triggering vSphere High Availability (HA) isolation responses and subsequent unplanned reboots of production Virtual Machines.

  • Few ESXi hosts triggered HA isolation responses; where as some of them never had impact

  • Impacted hosts exhibited a significantly lower iSCSI path count compared to the cluster baseline

  • No physical hardware defects or component failures were identified on the ESXi hosts or local NICs

  • Affected hosts entered an All Paths Down (APD) state, rendering the datastores inaccessible

  • The host regain the access post the first switch came back up after upgrade  

  • LUN shows 4 paths instead of 8 paths on working host 

naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea : XXXX iSCSI Disk (naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea)
   vmhba68:C2:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68  Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000003 PortalTag=1
   vmhba68:C3:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68  Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000004 PortalTag=1
   vmhba68:C0:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68  Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000001 PortalTag=1
   vmhba68:C1:T0:Lxxx LUN:xxx state:active iscsi Adapter: iqn.1998-01.com.xxxxxx:xxxxxx.xxx.xxx.xxx:1823666357:68  Target: IQN=iqn.2010-06.com.xxxxxx:xxxxxarray.513147f48e7a5e57 Alias= Session=00023d000002 PortalTag=1

  • All 4 paths were down when one vmnic associated to one switch went down 

yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu14:2098701)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc287fc240, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C2:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu14:2098701)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc287fc240, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C3:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu27:2098714)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc28778840, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C1:T0:L250" Failed:
yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu 4:2098691)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x12 (0x45bc26f4d700, 0) to dev "naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea" on path "vmhba68:C0:T0:L250" Failed:

  • Host experienced APD 

yyyy-mm-ddThh:mm:ss In(182) vmkernel: cpu5:2098414)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.624xxxxxxxxxxxxxxxxxxxxxx000113ea] has entered the All Paths Down state.

 

 

Environment

ESX 9.x
ESXi 8.x 

 

Cause

The absence of iSCSI Port Binding, which resulted in random session distribution. Consequently, all active iSCSI sessions were inadvertently pinned to a single physical uplink, creating a single point of failure that led to an All Paths Down (APD) state when the associated switch was rebooted. 

Resolution

  • Configure Port Binding with associate the iSCSI VMkernel interfaces with their respective physical NICs. Configuring Port Binding forces the ESXi host to establish and maintain active sessions through every bound VMkernel interface.

  • This configuration ensures that the failure of a single physical switch or NIC only results in a partial path loss; the host remains connected via the alternate fabric.

  • By maintaining at least 50% of the storage paths, the host avoids the APD state, thereby preventing HA isolation events and unplanned VM reboots.

  • Perform a storage adapter rescan to ensure the software initiator establishes redundant sessions across both physical fabrics. 

Note: It is recommended to put the host in Maintenance mode, clear all the old iSCSI sessions by removing the discovery IPs . If required reboot the host to clear all stale sessions.  

Reference: