Virtual Machines unresponsive during physical switch maintenance or reboot

Products

VMware vSphere ESXi

Issue/Introduction

A Virtual Machine (VM) configured with a high number of Raw Device Mappings (RDMs) may become unresponsive or "hang" during a physical SAN switch reboot. Although the ESXi host has multiple HBAs and redundant paths, the Guest OS may experience vCPU lockups, heartbeat timeouts, and ultimately become unavailable.
In the /vmfs/volumes/<datastore name>/<VM name>/vmware.log of the affected VM, messages similar to the following are observed:

YYYY-MM-DDT HH:MM:SS.655Z In(05) vcpu-## - PVSCSI: scsi#:##: aborting cmd 0x### - "<VM Name>_##.vmdk"
YYYY-MM-DDT HH:MM:SS.845Z In(05) vmx - GuestRpcSendTimedOut: message to toolbox timed out.
YYYY-MM-DDT HH:MM:SS.845Z In(05) vmx - Tools: [AppStatus] Last heartbeat value ##### (last received ##s ago)
YYYY-MM-DDT HH:MM:SS.765Z In(05) vcpu-0 - Tools: Tools heartbeat timeout.

In the ESXi /var/run/logvmkernel.log, the following storage layer aborts and task management failures may be visible:

YYYY-MM-DDT HH:MM:SS.033Z In(182) vmkernel: cpu##:#######)lpfc: lpfc_handle_status:####: <hba_id> ####: FCP cmd x## failed <#/#> sid x######, did x######, oxid x### iotag x### Abort Requested Host Abort Req
YYYY-MM-DDT HH:MM:SS.763Z Wa(180) vmkwarning: cpu##:#######)WARNING: VSCSI: ####: handle #################(GID:####)(vscsi#:##):WaitForCIF: Issuing reset; number of CIF:1

Environment

VMware vSphere ESXi
Fiber Channel SAN Infrastructure
Path Selection Policy (PSP) set to Round Robin (VMW_PSP_RR)

Cause

The unresponsiveness occurs when I/Os are trapped in a physical path failure that does not immediately trigger a "Link Down" status at the HBA level (e.g., a failure between an intermediate Top-of-Rack switch and a Core switch).
Path Failover Delay: With the default Round Robin (RR) policy, ESXi sends 1000 I/Os down one path before switching. If a path becomes a "black hole" (silent failure), the HBA driver attempts to abort commands.
Retries on Failing Paths: The failover algorithm may attempt to retry aborts and resets on other paths belonging to the same HBA before switching to a different physical HBA.
vCPU Lockup: For VMs, the cumulative delay of these aborts and resets across dozens of devices can cause the VMX process to wait for completion, leading to vCPU starvation and Guest OS unresponsiveness.
Switch Recovery Time: If physical switch ports take an extended time (e.g., 20 minutes) to reach an "Online" state after a reboot, the ESXi host may keep paths in an "Active/Dead" state transitions, further delaying stable failover to healthy HBAs.

Resolution

To resolve this issue and prevent VM unresponsiveness during switch maintenance, perform the following:

1. Infrastructure Redundancy (Primary Fix)

Ensure that the physical fabric layout provides complete end-to-end redundancy. Each Top-of-Rack (TOR) switch should have redundant uplinks to at least two independent Core switches. This ensures that a single Core switch reboot does not isolate an entire HBA's path to the storage array.

2. Optimize Path Selection Policy (Mitigation)

Reduce the impact of a single path failure by decreasing the number of I/Os sent before switching paths. This allows ESXi to detect a failing path faster and move I/O to a healthy HBA:

Adjusting Round Robin IOPS limit from default 1000 to 1

3. Monitor Port Initialization

Verify the configuration of physical switch ports (e.g., Enable "PortFast" or equivalent features where appropriate for edge ports) to ensure they return to a forwarding state promptly after a reboot.

Additional Information

While reducing the IOPS limit to 1 improves failover detection time, it is a workaround and not a substitute for proper physical fabric redundancy.
Highly sensitive workloads (e.g., MSCS/WSFC clusters, Oracle RAC) may still experience a brief impact or failover if the total time to clear I/O from a failed HBA exceeds the application's timeout thresholds.