vSAN Cluster partition
search cancel

vSAN Cluster partition

book

Article ID: 391883

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

A vSAN cluster reports multiple critical health alerts in Skyline Health, primarily centered around a vSAN Cluster partition and vSAN: Basic (unicast) connectivity check. Users may observe that one or more ESXi hosts appear disconnected from vCenter, and virtual machine performance becomes sluggish or unresponsive.

Technical discovery reveals the following:

  • The command esxcli vsan cluster get shows a Sub-Cluster Member Count of 1, indicating the host is isolated from the rest of the cluster.
  • Skyline Health reports failures for vMotion MTU check and vSAN MTU check (ping with large packet sizes).
  • Standard vmkping attempts between affected hosts over the vSAN VMkernel interface fail with 100% packet loss or sendto() failed (Host is down), even when using small packet sizes.
  • Skyline Health may also display Impending Disk Failure (SMART health alerts) on cache disks or other storage components, which can be exacerbated by the network partition.

1. Skyline health reported with vSAN Cluster partition with vSAN: Basic (unicast) connectivity check under vSAN cluster>Monitor>Skyline health

 

 

2. You see in ESXi CLI that the Physical Host/Witness is alone (partitioned apart from the other hosts of the vSAN cluster) running the following command.
 
esxcli vsan cluster get
 
[root@Host1:~] esxcli vsan cluster get
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
 
3. Testing connectivity between vSAN vmkernel ports fails when testing with vmkping
 
vmkping -I VMKX (WERE x IS THE VSAN VMK) ip
 
 
[root@Host1:~] vmkping -I vmkX ***.**.**.3
PING ###.###.###.### (###.###.###.###): 56 data bytes
 
--- ***.**.**.3 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

Environment

VMware vSAN (All versions)

Cause

The partition is caused by a networking misconfiguration where the vSAN traffic IP/VMkernel adapter is tagged with an incorrect VLAN or is connected to a physical switch port with mismatched VLAN trunking. This prevents the unicast networking required for cluster membership and metadata synchronization.

Resolution

To resolve the cluster partition, the VMkernel adapter configuration must be corrected to match the physical network environment:

  1. Identify the vSAN VMkernel Interface: Run esxcfg-vmknic -l to identify which vmk interface is tagged for vSAN traffic.
  2. Verify VLAN Tagging: Ensure the VLAN ID assigned to the vSAN VMkernel adapter matches the VLAN configured on the physical switch ports.
  3. Correct IP/VLAN Configuration: If a mismatch is found, reconfigure the VMkernel adapter with the correct IP address and VLAN ID. In the observed case, the issue was fixed by changing the VMkernel adapter IP and aligning it with the respective VLAN.
  4. Validate Connectivity: After reconfiguration, verify that vmkping -I vmkX <Target_IP> is successful and that esxcli vsan cluster get reflects the correct Sub-Cluster Member Count (e.g., 5).
  5. Address Secondary Alerts: Once the network partition is resolved, inspect Skyline Health for remaining issues, such as replacing disks reporting an "IMPENDING FAILURE" status.

Additional Information