This article addresses concerns regarding the storage stability of VMware ESXi hosts during a rolling upgrade of Cisco UCS Fabric Interconnects (FI). In environments where hosts are configured to boot from iSCSI LUNs with paths striped across redundant fabrics (Fabric A and Fabric B), there is often uncertainty regarding whether a host can survive the loss of 50% of its storage paths without a reboot.
Observed Symptoms during Maintenance (if not optimized):
Hypervisor: VMware ESXi 6.7, 7.x, or 8.x.
Hardware: Cisco UCS B-Series or C-Series Servers.
Storage Configuration: Boot-from-iSCSI LUNs.
Network Topology: Redundant iSCSI paths across isolated VLANs, each pinned to a specific Fabric Interconnect.
The potential for host instability during fabric maintenance is not caused by the loss of the physical path itself, but by a mismatch between hardware recovery times and software storage timeouts.
Path Selection Delay: By default, VMware may use a "Fixed" pathing policy. If the "Preferred Path" resides on the fabric being rebooted, the host must manually transition to a standby path. If this transition is not instantaneous, I/O timeouts occur.
Aggressive iSCSI Timers: The default RecoveryTimeout for iSCSI initiators may be shorter than the time required for a Cisco UCS Fabric Interconnect to complete its reset and re-initialize the logical links.
Multipathing Logic: VMware Native Multipathing (NMP) relies on the Pluggable Storage Architecture (PSA). If the host is not configured to "Round Robin" between all available paths, it may not efficiently utilize the surviving fabric's bandwidth, leading to I/O congestion during the upgrade window.
To ensure non-disruptive operations, the environment must be configured to utilize Round Robin pathing with optimized I/O limits and extended session timers. These settings should be applied via VMware Host Profiles to ensure cluster-wide consistency.
Edit the Host Profile associated with the cluster and navigate to the following paths:
iSCSI Login Timeout:
Path: Storage configuration > iSCSI Software Adapter > iSCSI adapter > Advanced Options
Value: Set LoginTimeout to 30 seconds. This allows sufficient time for ARP resolution and session re-establishment after the fabric returns.
iSCSI Recovery Timeout:
Path: Same as above.
Value: Set RecoveryTimeout to 60 seconds. This keeps the I/O queue "alive" while the fabric transitions, preventing the host from declaring a Permanent Device Loss (PDL).
Ensure that the host checks path health and distributes I/O every single command:
Path Selection Policy: Set the default policy for the storage array's SATP (Storage Array Type Plugin) to VMW_PSP_RR.
IOPS Limit: Under Device Specific Selection Policy, ensure the IOPS limit is set to 1. This ensures that if a path on Fabric A fails, the very next I/O command is instantly redirected to Fabric B.
Prior to initiating the firmware reboot in UCS Manager (UCSM), utilize the Fabric Evacuation feature:
Navigate to Equipment > Fabric Interconnects > Fabric Interconnect A.
Select Evacuate Fabric.
Monitor the ESXi hosts. This gracefully drains traffic to the alternate fabric, allowing you to verify host stability before the FI physically reboots.
Execute the following script to ensure all hosts have correctly inherited the settings:
Get-Cluster "YourCluster" | Get-VMHost | ForEach-Object {
$esxcli = Get-EsxCli -VMHost $_ -V2
$esxcli.storage.nmp.device.list.Invoke() | Select-Object @{N="Host";E={$_.VMHost.Name}}, Device, PathSelectionPolicy
}
Effectiveness: These steps transition the storage failover from a "reactive" hardware event to a "proactive" software-managed process. By extending timeouts and forcing frequent path switching (IOPS=1), the ESXi host remains unaware of the underlying maintenance, maintaining continuous I/O flow to the iSCSI LUNs.
To identify your Storage Array Type Plugin (SATP), you need to see which specific driver VMware has assigned to your LUNs. This is critical because Host Profile rules for "Round Robin" and "IOPS=1" are often applied based on the SATP name (e.g., VMW_SATP_ALUA or VMW_SATP_DEFAULT_AA).
Run the following command on any host in the cluster:
esxcli storage nmp device list
Look for the line Storage Array Type. Common values include:
VMW_SATP_ALUA: Used for most modern mid-to-high-end arrays (NetApp, EMC, Pure).
VMW_SATP_DEFAULT_AA: Used for Active-Active arrays.
VMW_SATP_DEFAULT_AP: Used for Active-Passive arrays.
Once you have the SATP name, follow these steps to ensure every LUN from that vendor uses the optimized settings:
Open Host Profile: Go to Storage configuration > Native Security Path Selection Policy.
Add Path Selection Option: * SATP: Enter the name you found above (e.g., VMW_SATP_ALUA).
PSP: Set to VMW_PSP_RR.
Add Device Specific Rule:
Under the Round Robin Action section, ensure the IOPS value is set to 1.
| Action Item | Method | Targeted Value |
| LoginTimeout | Host Profile (Advanced) | 30 |
| RecoveryTimeout | Host Profile (Advanced) | 60 |
| Path Policy | Host Profile (SATP Rule) | VMW_PSP_RR |
| Switching frequency | Host Profile (PSP Rule) | IOPS = 1 |
| Fabric Maintenance | UCS Manager | Fabric Evacuation: Enabled |
After applying the Host Profile and remediating the hosts, perform a manual rescan. If the hosts are correctly configured, you should be able to run this command and see the number 1 for the "iops" limit on every LUN:
esxcli storage nmp psp roundrobin deviceconfig get -d <naa.ID>