Network Loss on ESXi During Physical Switch Reboot or Hardware Failure Without Link State Change
search cancel

Network Loss on ESXi During Physical Switch Reboot or Hardware Failure Without Link State Change

book

Article ID: 426396

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • During a physical network maintenance or switch reboot, virtual machines hosted on ESXi experience a total loss of connectivity despite having redundant physical uplinks.
  • ESXi Host vobd.log shows zero NIC-down events during the outage, which is a key indicator of a silent failure.
  • Since the uplink switch uplink remains in a "Link Up" state from the ESXi perspective, it does not trigger an automatic failover.

 

In the below example scenario, to maintain redundancy, one of the uplinks is connected to the physical switch that will be undergoing a reboot, whereas the other one is connected to another physical switch. The uplink connected to the rebooting switch remains in a "Link Up" state from the ESXi perspective. Consequently, the ESXi host does not trigger an automatic failover to the healthy redundant path, continuing to route traffic towards the unresponsive physical switch.

Environment

VMware vSphere ESXi

Cause

  • This issue occurs because the physical switch enters a state where it is no longer forwarding traffic but still provides enough electrical signaling or "Keep Alive" to the NIC to maintain a physical link-up status.
  • During a reboot or hardware failure, the switch management and control planes may go offline, but the physical port hardware may not effectively "shut down" or "link down" the connected peer (the ESXi vmnic).
  • When the Teaming Policy is set to Link Status Only, ESXi relies strictly on the network card reporting a "Link Down" event (e.g., loss of light or signal). If the switch port remains physically active but logically dead, ESXi remains unaware of the upstream failure.

Resolution

  • To resolve the immediate connectivity loss, the "dead" path must be manually shut down so that ESXi is forced to use the alternate uplink which resides on a healthy working switch.
  • If a switch is unresponsive or in a "hung" state but maintains a link light, the network administrator should manually shut down the ports on this switch (if applicable) or physically disconnect the cables to force the ESXi host to trigger a failover.
  • In a nutshell:
    • Coordinate with the Physical Network Team to identify the hung switch.
    • Manually administratively shut down (Admin Down) the affected switch ports.
    • Verify that ESXi detects the "Link Down" state and fails over to the healthy redundant path.
    • Evaluate the use of LACP or Beacon Probing to automate this detection.

NOTE: One additional check is to verify VLAN trunking on the "healthy" standby uplink as hosts may remain disconnected if the secondary path is physically up but missing the required Management VLAN.

Additional Information

Refer to the following article for more information on implementing: Beacon Probing