ESXi Hosts in "Not Responding" State

Products

VMware vSphere ESXi VMware vCenter Server

Issue/Introduction

ESXi hosts within a vSphere cluster may appear as "Not responding" within the vCenter Server inventory, resulting in a complete loss of management plane access via the vSphere Client.

You may see similar to the below:

Multiple ESXi hosts enter a "Not responding" state in vCenter.
Virtual machines on the affected hosts may remain running and accessible via their data networks, but host management operations cannot be performed through vCenter.
Direct vSphere Client access to the vCenter Server shows the hosts as disconnected.

Troubleshooting & Verification

If this issue occurs, perform the following steps to isolate the communication breakdown. In this scenario, testing will isolate the issue to the physical network layer rather than a host-level service failure:

Gateway Ping Test: Verify if the affected ESXi hosts can reach their local network by pinging their default gateways.
- Result: Success indicates the host's management network interface is active.
vCenter Ping Test: Attempt to ping the vCenter Server VM directly from the affected ESXi hosts.
- Result: Failure indicates a routing or transit issue.
Port Connectivity Test: Execute Netcat (nc -zv {vcenter-IP} 443) commands from the problematic hosts to the vCenter Server VM on required management ports (e.g., 443, 902).
- Result: Connections will time out with no response.
Packet Capture (PCAP) Analysis (Packet capture on ESXi using the pktcap-uw tool): Run concurrent packet captures on both the problematic ESXi host and the host where the vCenter Server VM resides. Send ICMP requests from the affected host to vCenter.
- Result: The PCAP will show traffic successfully egressing the affected ESXi host but failing to arrive at the vCenter host, definitively proving that packets are being dropped in transit.

Environment

VMware vSphere ESXi

VMware vSphere vCenter Server

Cause

This specific issue is caused by a physical network failure, often induced by a site power outage or similar disruptive event. The power event causes redundant physical firewalls sitting in the transit path between the ESXi hosts and vCenter to enter a continuous flapping state.

Network and firewall flapping prevents reliable packet delivery and TCP session establishment, resulting in dropped packets and the subsequent host disconnects in vCenter.

Resolution

To resolve the issue and restore management access:

Engage the Networking Team: Work with the physical networking or security team responsible for the transit firewalls or other network hops.
Remove Redundancy Temporarily: Have the networking team temporarily remove redundancy (e.g., fail over to a single active node or disable the flapping secondary node) on the affected physical firewalls.
Stabilize the Path: By forcing traffic through a single, stable path, the flapping behavior is stopped.
Verify Reconnection: Once the network transit is stabilized, the ESXi hosts will automatically re-establish communication with vCenter. Verify that full management access is restored in the vSphere Client and the hosts return to a "Connected" state.

Additional Information

Packet capture on ESXi using the pktcap-uw tool

All ESXi hosts are disconnected or in not responding state after vCenter reboot

All ESXi not responding or disconnecting & reconnecting to their managed vCenter Server

ESXi hosts intermittently show "Not Responding" in vCenter when TCP port 53 is blocked