Intermittent vSAN cluster partitions and Object Health alarms due to Witness Traffic Separation (WTS) routing conflicts

search cancel

Intermittent vSAN cluster partitions and Object Health alarms due to Witness Traffic Separation (WTS) routing conflicts

book

Article ID: 437464

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

vSAN cluster partitions trigger intermittently, lasting between 5 to 30 minutes before healing on their own.
vCenter Server displays alarms for vSAN Object Health and Stats DB object degradation.
The vSAN Health score drops temporarily during these events and normalizes once the partition heals.
The environment utilizes a vSAN Stretched Cluster or 2-Node Cluster with Witness Traffic Separation (WTS) configured.
In the /var/run/log/vmkernel.log, CMMDS events indicate nodes are intermittently leaving and rejoining the cluster (CMMDS: Node <Witness_UUID> left cluster).

Environment

VMware vSAN 8.x Stretched cluster

Cause

This issue occurs due to a "split-brain" routing and interface binding conflict for vSAN Witness traffic.

In a WTS architecture, vSAN Witness traffic is bound to a specific VMkernel interface (e.g., vmk0 / Management). However, if a static route is inadvertently configured for the Witness appliance subnet that points out of a different VMkernel interface (e.g., vmk2 / vSAN Data), a severe network conflict arises:

The Binding: The vSAN service binds Witness heartbeats (CMMDS) to the intended interface (vmk0), generating packets with a Source IP belonging to the Management subnet.
The Route: The ESXi routing table intercepts the traffic and forces it out the vmk2 interface onto the vSAN VLAN.
The Drop: The heartbeat packets physically leave the host on the vSAN Data VLAN but carry a Management Source IP. Upstream physical network switches detect this as asymmetric or spoofed traffic and drop the packets (commonly due to Unicast Reverse Path Forwarding / uRPF security protocols).

When these heartbeats are dropped, the ESXi host loses connection to the Witness, causing a temporary cluster partition and triggering the associated health alarms.

Resolution

To resolve this issue, the ESXi routing table must be aligned with the vSAN interface binding so that Witness traffic egresses the host via the correctly tagged VMkernel port.

Note: Modifying Witness routing is non-disruptive to vSAN storage data I/O and will not impact running virtual machines.

Scenario A: Witness traffic is intended to flow over the Management network (Most Common) If WTS is properly bound to the Management interface (vmk0), you must delete the rogue static route that is hijacking the traffic. This allows the host to natively use the Management network's default gateway.

Run the following command to remove the conflicting route: esxcli network ip route ipv4 remove -n <Witness_Subnet>/<Mask> -g <Incorrect_Gateway_IP> (Example: esxcli network ip route ipv4 remove -n 192.168.###.0/24 -g 10.0.0.1)

Scenario B: Witness traffic is intended to flow over a dedicated routing path (e.g., vmk2) If the static route is correct and the traffic should be routed out of vmk2, then the Witness tag must be removed from vmk0 and applied to the correct interface.

Remove the witness tag from the incorrect interface: esxcli vsan network remove -i vmk0
Add the witness tag to the correct interface (if it is not already sharing the vsan tag): esxcli vsan network ipv4 add -i vmk2 -T witness
Restart the vSAN management agents to bind to the new configuration: /etc/init.d/vsanmgmtd restart

Once the routing table matches the vSAN interface binding, the upstream switch will stop dropping the packets, and the transient partitions will permanently cease.

Feedback

thumb_up Yes

thumb_down No