Network partition observed in vSAN cluster
search cancel

Network partition observed in vSAN cluster

book

Article ID: 393677

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms

vCenter Server: Multiple virtual machines appear in an "Inaccessible" or "Reduced availability" state.

vSAN Health: The vSAN Skyline Health score drops significantly (e.g., to 39% or lower).

Object Health: Running the "esxcli vsan debug object health summary get" command shows objects in an inaccessible state

Cluster Status: The host reports a sub-cluster member count of 1, indicating it is isolated in a network partition:

[root@ESX2~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2025-02-07T00:16:32Z
Local Node UUID: ########-####-####-####-########
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: ########-####-####-####-########
Sub-Cluster Backup UUID:
Sub-Cluster UUID: ########-####-####-####-########
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: ########-####-####-####-########
Sub-Cluster Member HostNames: Hostname###
Sub-Cluster Membership UUID: ########-####-####-####-########
Unicast Mode Enabled: true
Maintenance Mode State: OFF

Network Tests: Connectivity tests via vmkping result in 100% packet loss in one direction, failing with a "Host is down" error:

[root@ESXi:~] vmkping -I vmk2 10.##.##.#7
PING 10.##.##.#7 (10.##.##.#7): 56 data bytes
sendto() failed (Host is down)

VMkernel Logs: The /var/run/log/vmkernel.log file actively reports CMMDSNet unicast channel failures:

YYYY-MM-DDTHH:MM.SSSZ In(182) vmkernel: cpu42:2119777)CMMDSNet: CMMDSNetSendtoUnicastChannels:1665: Throttled: ########-####-####-####-############: Failed to send to unicast host '##.###.###.###;12321' on iface '##.###.###.###': Host is down.

 

Environment

VMware  vSAN 7.x

VMware vSAN 8.x

 

Cause

This issue is primarily caused by an ARP Resolution Failure at Layer 2.

The ESXi host sends ARP requests to discover peer MAC addresses, but replies are not returning from the physical network infrastructure. Without a resolved MAC address, the VMkernel cannot encapsulate Layer 3 vSAN traffic into Layer 2 frames. This encapsulation failure results in broken communication and the "Host is down" errors, even though vSAN services are running normally.

Diagnostic Evidence 

ARP Table Analysis: Check the ESXi ARP table. If the host cannot map the peer's IP to a MAC address, the state will show as (incomplete).

[root@ESXi:~] esxcli network ip neighbor list
Neighbor     Mac Address        Vmknic    Expiry  State  Type
-----------  -----------------  ------  --------  -----  ----
10.##.##.#3  (incomplete)       vmk0     -38 sec         Invalid
10.##.##.#5  (incomplete)       vmk0     -83 sec         Invalid
10.##.##.#7  (incomplete)       vmk0    -766 sec         Invalid

Packet Capture Findings: Network traces (pktcap-uw) will show that ARP requests are successfully exiting the ESXi host, but the corresponding ARP replies from the peer node are never reaching the destination interface.

Isolation Verification: Temporarily moving vSAN traffic to an alternate VMkernel interface (e.g., vmk1) restores connectivity. This confirms that the vSAN software layer is healthy, and the failure is strictly isolated to the physical network path or switchport configuration associated with the primary NIC.

Resolution

  • Engage networking team to resolve the network issue.

Additional Information