In a vSAN Stretched Cluster configuration (Management and Workload clusters), the witness appliance was observed as partitioned in Skyline Health. This issue can cause Reduced Availability with No Rebuild status across multiple vSAN objects.
Connectivity Tests:
[root@esxihost:~] vmkping -I vmk3 10.##.##.3PING 10.##.##.3 (10.##.##.3): 56 data bytes64 bytes from 10.##.##.3: icmp_seq=0 ttl=64 time=0.159 ms64 bytes from 10.##.##.3: icmp_seq=1 ttl=64 time=0.130 ms64 bytes from 10.##.##.3: icmp_seq=2 ttl=64 time=0.119 ms
--- 10.##.##.3 ping statistics ---3 packets transmitted, 3 packets received, 0% packet lossround-trip min/avg/max = 0.119/0.136/0.159 ms
[root@esxihost:~] vmkping -I vmk5 10.##.##.11 -s 1472PING 10.##.##.11 ( 10.##.##.11): 1472 data bytes1480 bytes from 10.##.##.11: icmp_seq=0 ttl=53 time=18.261 ms1480 bytes from 10.##.##.11: icmp_seq=1 ttl=53 time=17.592 ms1480 bytes from 10.##.##.11: icmp_seq=2 ttl=53 time=17.826 ms
--- 10.##.##.11 ping statistics ---3 packets transmitted, 3 packets received, 0% packet lossround-trip min/avg/max = 17.592/17.893/18.261 ms
Packet Capture Results
Verification Commands:
tcpdump-uw -i <vmk-interface> | grep <witness-IP>
11:04:05.501279 IP <Datanode IP>.12321 > <WitnessFQDN>.12321: UDP, length 44011:04:06.501250 IP <Datanode IP>.12321 > <WitnessFQDN>.12321: UDP, length 44011:04:07.501295 IP <Datanode IP>.12321 > <WitnessFQDN>.12321: UDP, length 44011:04:08.501281 IP <Datanode IP>.12321 > <WitnessFQDN>.12321: UDP, length 440
pktcap-uw --vmk <vmk-interface> --dir 2 -o - | tcpdump-uw -ner - | grep <witness-IP>
The name of the vmk is <vmk-interface>.pktcap: The output file is -.pktcap: No server port specifed, select 16##4 as the port.pktcap: Local CID 2.pktcap: Listen on port 16##4.pktcap: Main thread: 895#####68.pktcap: Dump Thread: 895####76.pktcap: The output file format is pcapng.pktcap: Recv Thread: 895#####60.pktcap: Accept...reading from file -pktcap: Vsock connection from port 1##7 cid 2., link-type EN10MB (Ethernet), snapshot length 65##511:30:45.513987 00:##:##:##:##:e7 > 00:##:##:##:##:ff, ethertype IPv4 (0x0800), length 482: <Datanode IP>.12321 > <WitnessIP>.12321: UDP, length 44011:30:46.514027 00:##:##:##:##:e7 > 00:##:##:##:##:ff, ethertype IPv4 (0x0800), length 482: <Datanode IP>.12321 > <WitnessIP>.12321: UDP, length 44011:30:47.514035 00:##:##:##:##:e7 > 00:##:##:##:##:ff, ethertype IPv4 (0x0800), length 482: <Datanode IP>.12321 > <WitnessIP>.12321: UDP, length 44011:30:48.514080 00:##:##:##:##:e7 > 00:##:##:##:##:ff, ethertype IPv4 (0x0800), length 482: <Datanode IP>.12321 > <WitnessIP>.12321: UDP, length 440
Firewall Verification:
Testing with nc confirmed that UDP port 12321 was blocked, while other ports (for example, 2233) were open
[root@esxihost] nc -u <witness IP> 12321
[root@esxihost] nc -zv <witness IP> 2233
Connection to <witness ip> 2233 port [tcp/*] succeeded!
Advance parameters
[root@esxihost:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdatesValue of IgnoreClusterMemberListUpdates is 0
[root@esxihost:~] esxcfg-advcfg -g /VSAN/DOMPauseAllCCPsValue of DOMPauseAllCCPs is 0
VMware vSAN 8.x
Traffic over UDP port 12321, which is used by the vSAN Cluster Monitoring, Membership, and Directory Service (CMMDS) process, was blocked or filtered by a firewall or network security policy.
This blockage prevented heartbeat communication between the data nodes and the witness appliance, resulting in a cluster partition.
This issue can occur when network security devices, such as firewalls or traffic filters, interrupt or misroute vSAN communication over the required UDP ports.
It is recommended to include UDP 12321 in the allowed ports list for vSAN environments to prevent similar partition scenarios.
To understand more on ports required for vSAN, refer vSAN ports