VMware vSAN (2-Node or Stretched Cluster)
vSAN Witness Appliance
This issue occurs when vSAN network communication is interrupted between the data nodes and the Witness Appliance. Common causes include:
Missing or incorrect static routes.
MTU size mismatch across the routing path.
vSAN traffic tagging applied to the incorrect VMkernel interface or is missing.
Firewalls or physical network switches blocking vSAN communication.
esxcli vsan network listNote: When executing network tests or configurations, ensure the correct VMkernel interface is utilized. If a VMkernel interface is explicitly tagged for vSAN Witness traffic, utilize that designated interface and its assigned IP address. In the absence of a dedicated Witness tag, default to the VMkernel interface tagged for standard vSAN traffic.
vmkping tests to validate network connectivity between the vSAN data nodes and the Witness Appliance:vmkping -I vmkX <vSAN IP for witness traffic> The maximum allowable Round-Trip Time (RTT) latency between the vSAN Witness Appliance and the ESXi data nodes depends on the specific cluster topology and scale. Ensure the network architecture adheres to the following thresholds:
Standard Stretched Clusters (1 to 10 nodes per site): RTT must remain below 200ms.
Large Stretched Clusters (11 to 20 nodes per site): RTT must remain below 100ms.
2-Node / ROBO Clusters: RTT must remain below 500ms.
vmkping -I vmkX <vSAN IP for witness traffic> -d -s <payload minimum> To accurately validate network configuration, ICMP echo requests must be transmitted with a full, unfragmented frame payload to account for header overhead. For networks configured with a standard MTU of 1500, the unfragmented payload size must be 1472 bytes. For networks utilizing jumbo frames (MTU 9000), the unfragmented payload size must be 8972 bytes.tcpdump-uw -i vmkX | grep 12321And Execute the following command directly on the Primary and Backup nodes:tcpdump-uw -i vmkX | grep <witness IP/FQDN> ##### unknown-udp DISCARD FLOW 10.###.###.###[12321]/Shell/17 (10.###.###.###[12321])
##### 10.###.###.###[12321]/Untrusted (10.###.###.###[12321])0 by executing the commands below:esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates
esxcfg-advcfg -g /VSAN/DOMPauseAllCCPsReassociate the vSAN Witness Appliance with the cluster.
Impact/Risks: Before disabling the stretched cluster, always confirm that all other fault domains are up and accessible.
Once this is verified and the firewall is not dropping sessions, blocking port communication or there is no firewall in use between data sites and witness, follow these steps:
vdq -iH. If the disk group is still present manually remove the disk group via the below steps.esxcli vsan storage remove -u <VSAN Disk Group UUID>
or
esxcli vsan storage remove -s <VSAN Disk Group Cache Identifier>For ESA enabled clusters:
esxcli vsan storagepool remove -u <VSAN Device UUID>
or
esxcli vsan storagepool remove -d <VSAN Device ID>Refer to below documents for more details on additional troubleshooting and steps: