Troubleshooting vSAN Witness Node Isolation
search cancel

Troubleshooting vSAN Witness Node Isolation

book

Article ID: 315546

calendar_today

Updated On: 03-13-2025

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • A vSAN Witness node (Virtual or Physical) is isolated.

    To confirm witness node isolation run the command: 

esxcli vsan cluster get


If the output of the command returns:
Sub-Cluster Member Count: 1
Local Node State: STANDALONE

Or 

Sub-Cluster Member Count: 0
Local Node State: Discovery


Then the Witness is confirmed to be isolated from the vSAN Cluster.

  • The vSAN Witness node cannot form a cluster with the remaining vSAN data nodes in a stretched cluster configuration.
  • Pinging the Witness node from a vSAN ESXi host fails.
  • Pinging an ESXi host from a Witness works, but not with a full TCP frame. You can use below vmkping command to test the connectivity :  

    vmkping -I <witness-vmk#> <vsan-IPaddress> -s <icmp-data-size> -d

    Note: -d option is for 'don't fragment' bit on the IPv4 packet. 
    -s is for size. 8972 is the size needed for 9000 MTU and 1472 is the size needed for MTU 1500.



Environment

VMware vSAN 6.x
VMware vSAN 7.x
VMware vSAN 8.x

 

Cause

  • The witness node may respond to ping from vSAN vmkernel ports but the port 12321 is blocked or not reachable 
  • The reason can be due to firewall  or network security 

Resolution

In a vSAN stretched cluster the Witness plays an important role assuring keeping all the witness components of the vSAN objects available.
  • The VMkernel port on the vSAN witness should be able to ping to Data nodes in the vSAN cluster
  • The port 12321 must be open between all the nodes ( in case of ROBO / Stretched cluster ) and should be able to communicate bidirectionally  
To ensure proper TCP/IP communication between the data hosts and the Witness, these requirements exist:
  • Round-Trip Time (RTT) latency between the Witness and the ESXi hosts must be <200ms (500ms in ROBO cluster, 100ms if 11-20 nodes per site)
     
  • A full frame must be sent between pings. If using MTU 1500, the unfragmented payload must be at least 1472 bytes. 
  • To verify if the payload can be sent, run this command from one of the ESXi hosts: 

    # vmkping -I <VSANvmknic> <WitnessIP> -s 1472 -d -c20


    If the ping fails, something on the network is not allowing the full payload to travel between the ESXi and the Witness node. 
  • Verify the unicast table in the ESXi hosts, by running the following command 

    # esxcli vsan cluster unicastagent list

    Ex. 
     ESXI_DATA_NODE #  esxcli vsan cluster unicastagent list
    NodeUuid                              IsWitness  Supports Unicast  IP Address     Port  Iface Name  Cert Thumbprint                                              SubClusterUuid
    ------------------------------------  ---------  ----------------  ------------  -----  ----------  -----------------------------------------------------------  --------------
    5d56c452-XXXX-XXXX-XXXX-e4434b76d442          0              true  X.X.X.X       12321              XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX  xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    614c531d-XXXX-XXXX-XXXX-0050569d1702          1              true  Y.Y.Y.Y       12321              YY:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:YY  xxxxxxxx-yyyy-yyyy-yyyy-xxxxxxxxxxxx

    • The Witness appears with the value 1 in the "is witness" section.In the example above the highlighted part is vSAN witness showing details like vSAN UUID  and IP address.
       
  • In case that the Witness does not appear in the unicastagent list, we can add it by running the following commands
     
    • From the Witness node, esxcli vsan cluster get, collect the local UUID
    • From the ESXi hosts,  esxcli vsan cluster unicastagent add -t witness -u <local_UUID> -U true -a <vSAN IP address> -p 12321 
  • The command to capture the packet going out and coming in from witness and data node can be used to verify the traffic  :

    In the example below, the IP address of the Witness VMK is XX.XX.XX.XX and IP address of the ESXi Data Node is YY.YY.YY.YY

    We can see the Data node is sending the packet over port 12321 but it is not getting any response from Witness node as it is one way traffic 

    ESXI_DATA_NODE#   pktcap-uw --vmk vmk0 --dir 2 -o - | tcpdump-uw -ner - |  grep XX.XX.XX.XX  
    The name of the vmk is vmk0.
    pktcap: The output file is -.
    pktcap: No server port specifed, select 18234 as the port.
    pktcap: Local CID 2.
    pktcap: Listen on port 18234.
    pktcap: Main thread: 129712012096.
    pktcap: Dump Thread: 129712547584.
    pktcap: Recv Thread: 129713075968.
    pktcap: Accept...
    pktcap: Vsock connection from port 1026 cid 2.
    reading from file -, link-type EN10MB (Ethernet)
    XX:yy:zz.123456 e4:43:4b:76:d4:22 > aa:bb:cc:dd:dd:ee, ethertype IPv4 (0x0800), length 482: YY.YY.YY.YY.12321 > XX.XX.XX.XX.12321: UDP, length 440        >> Traffic is outgoing from Data node and not getting any response from witness node
    XX:yy:zz.123456 e4:43:4b:76:d4:22 > aa:bb:cc:dd:dd:ee, ethertype IPv4 (0x0800), length 482: YY.YY.YY.YY.12321 > XX.XX.XX.XX.12321: UDP, length 440
    XX:yy:zz.123456 e4:43:4b:76:d4:22 > aa:bb:cc:dd:dd:ee, ethertype IPv4 (0x0800), length 482: YY.YY.YY.YY.12321 > XX.XX.XX.XX.12321: UDP, length 440  

    The above command can be used on witness node to check outgoing and incoming traffic on witness node by selecting appropriate IP and VMK ports
     
  • Static routes must be created on all the ESXi hosts.
    Additional information on how to add static routes to ESXi hosts can be found in the Network Design for Stretched Clusters
     
  • Tagging can be used instead of static routes in vSAN 6.6, and higher. Please see Configure Network Interface for Witness Traffic for more information. 
  • Verify the vSAN tags with the command on Data node and witness node
     
    • esxcli vsan network list interface 

      ESXI DATA NODE

         ESXI_DATA_NODE # esxcli vsan network list Interface
                
            VmkNic Name: vmk2   >> VMK tagged for vSAN Data traffic 
         IP Protocol: IP
         Interface UUID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
         Agent Group Multicast Address: 224.2.3.4
         Agent Group IPv6 Multicast Address: xxxx::2:3:4
         Agent Group Multicast Port: 23451
         Master Group Multicast Address: 224.1.2.3
         Master Group IPv6 Multicast Address: xxxx::1:2:3
         Master Group Multicast Port: 12345
         Host Unicast Channel Bound Port: 12321
         Multicast TTL: 5
         Traffic Type: vsan    >>  Used for Data Traffic 

         Interface
         VmkNic Name: vmk0   >> VMK tagged for vSAN witness traffic 
         IP Protocol: IP
         Interface UUID: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
         Agent Group Multicast Address: 224.2.3.4
         Agent Group IPv6 Multicast Address: yyyy::2:3:4
         Agent Group Multicast Port: 23451
         Master Group Multicast Address: 224.1.2.3
         Master Group IPv6 Multicast Address: yyyy::1:2:3
         Master Group Multicast Port: 12345
         Host Unicast Channel Bound Port: 12321
         Multicast TTL: 5
         Traffic Type: witness  >> Used Witness traffic 

      vSAN WITNESS NODE

       
       
      vSAN_WITNESS_NODE # esxcli vsan network list 
         VmkNic Name: vmk1  >> VMK tagged for vSAN witness traffic 
         IP Protocol: IP
         Interface UUID: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
         Agent Group Multicast Address: 224.2.3.4
         Agent Group IPv6 Multicast Address: yyyy::2:3:4
         Agent Group Multicast Port: 23451
         Master Group Multicast Address: 224.1.2.3
         Master Group IPv6 Multicast Address: yyyy::1:2:3
         Master Group Multicast Port: 12345
         Host Unicast Channel Bound Port: 12321
         Multicast TTL: 5
         Traffic Type: witness/vSAN  >> Used Witness traffic
       

    • For the details of the vSAN tagging, please refer to the following document: Understanding the vSAN Witness Host – Traffic Tagging
       
  • Verify the ESXi version of the Witness is the same build as the rest of the cluster, as version mismatch will prevent the Witness node from joining the cluster.

Recommendation

  • The Management (vmk0) and WitnessPg (vmk1) VMkernel interfaces on the vSAN Witness node must not be configured to use addresses on the same subnet.
  • In some cases the vmk0 of the witness is used as both management and vSAN witness which is supported
    • If only a single subnet is available for the vSAN Witness node, it is recommended to untag vSAN traffic on vmk1 and tag vSAN traffic on vmk0 on the vSAN Witness node.
    • The management traffic and witness traffic will use the vmk0  which is called multihoming situation 
  • The Multihoming situation references can be found in Article 318546.



Additional Information

We recommend you to refer below document for detail understanding of vSAN cluster architecture. 
  • vSAN 2-Node Cluster Guide
    https://www.vmware.com/docs/vmw-vsan-2-node-cluster-guide

  • vSAN Stretched Cluster Guide
    https://www.vmware.com/docs/vsan-stretched-cluster-guide