ESXi hosts in "not responding"/"disconnected" state when host's management default gateway is not reachable
search cancel

ESXi hosts in "not responding"/"disconnected" state when host's management default gateway is not reachable

book

Article ID: 416921

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • ESXi hosts shows as "not responding"/"disconnected" in vCenter Server.
  • Unable to ping the default gateway of the ESXi host from DCUI when the ESXi host is in disconnected state.
  • In the /var/log/vmware/vpxd/vpxd.log file:
  • <YYYY-MM-DD>T<time>.696-05:00 [08128 info 'vpxdvpxdMoHost' opID=########-########-##] [HostMo] host connection state changed to [DISCONNECTED] for host-ID
    <YYYY-MM-DD>T<time>.508-04:00 [04944 error 'vpxdvpxdInvtHostCnx' opID=HB-host-ID@####-########] [VpxdInvtHostSyncHostLRO] FixNotRespondingHost failed for host host-ID, marking host as notResponding
    <YYYY-MM-DD>T<time>.633-04:00 [00812 error 'vpxdvpxdInvtHostCnx' opID=HB-host-ID@####-########] [VpxdInvtHostSyncHostLRO] FixNotRespondingHost failed for host host-ID, marking host as notResponding

Environment

VMware ESXi

Cause

  • In the packet captures performed on uplink used by management network, its observed that while pinging the default gateway of ESXi host  in DCUI, the ARP requests are leaving the host's uplink but no ARP replies are coming on the uplink of host.
  • Below command can be used to perform packet captures on uplink of ESXi host used by management network: 
    • pktcap-uw --uplink vmnicx --capture UplinkSndKernel,UplinkRcvKernel -o - | tcpdump-uw -r - -enn 
  • Refer following KB to perform packet captures on ESXi host:https://knowledge.broadcom.com/external/article/341568/using-the-pktcapuw-tool-in-esxi-55-and-l.html

Resolution

  • To temporarily restore management network connectivity and allow continued operations:

    1. Identify Affected Uplink: Determine which ESXi host uplink (vmnicX) is experiencing the connectivity issue.
    2. Manually Disable Uplink: Via SSH to the ESXi host, execute the following command to bring down the affected uplink. This will force the ESXi host's management network traffic to fail over to another available uplink in its teaming configuration.
      esxcli network nic down -n vmnicX
      (Replace vmnicX with the specific uplink name, e.g., vmnic0).

NOTE: To bring up the uplink , below command can be used:

esxcli network nic up -n vmnicx

  • For this workaround to be successful, the following conditions must be met:

    • Uplink Redundancy: The ESXi host's management network must be configured with uplink redundancy (network teaming) on its vSwitch or vDS. This means there must be at least two active uplinks associated with the management network's port group.
    • VLAN Tagging: The VLAN ID used by the ESXi host's management network must be correctly tagged and allowed on the physical switchport to which the redundant/other uplink (vmnicY in the team) is connected. Without correct VLAN tagging, the management traffic cannot traverse the network.
    • Host Network Configuration: The vSwitch or vDS configuration on the ESXi host must be properly set up for the intended teaming policy (e.g., Route based on originating virtual port ID, IP hash, etc.) to leverage the redundant uplinks.
    • Physical Switchport Configuration: The physical switchports connected to the teamed uplinks (vmnicX and vmnicY) should be configured consistently and appropriately (e.g., all in trunk mode, same allowed VLANs, LACP configured if using IP hash teaming).
  • Engage physical network team to troubleshoot further as packets are dropping in physical network while pinging the default gateway of ESXi host.

Additional Information