vSAN Witness appliance partitioned from the stretched cluster
search cancel

vSAN Witness appliance partitioned from the stretched cluster

book

Article ID: 326958

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Witness appliance experiences some type of failure or network partition and then will not re-join the cluster, even when disabling/re-enabling the cluster.

Impact/Risks:
Before disabling the stretched cluster, always confirm that all other fault domains are up and accessible.

Environment

VMware vSAN

Resolution

To resolve this issue:
 
  •  First verify the witness is on the exact same version of ESXi as the data nodes. If it isn't, update the node using your standard practices for upgrading nodes to match the data nodes: Upgrading ESXi Hosts. If necessary re-deploy a new witness on the exact build of the data nodes. 


  • Check for duplicate IP on witness appliance:
    • Shut down witness appliance and ping witness appliance IP to confirm no other device or system in the environment is already using the same IP.
      • vmkping -I vmkX x.x.x.x

 

  • Then verify that there are no actual communication issues between the data nodes and the witness appliance before proceeding. (See Testing VMkernel network connectivity with the vmkping command (1003728)  for more details on using vmkping)
    Note: Only the Primary Node and the Backup will be reaching out to the witness over port 12321.

    Verify connectivity by running the below vmkping tests between the vSAN vmks as the witness node or any node for that matter can't join the cluster if the packets are fragmented.
    To test 1500 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 1472
    To test 9000 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 8972
        

    In addition, check the connectivity between the witness appliance and the data nodes via port 12321.

    On the witness: tcpdump-uw -i vmkX
    (vmkx is the vmk port that is used for witness traffic)

    If there is connectivity you will see incoming requests and responses over port 12321 to both the Primary Node and the Backup.

    Example:(Note: Also verify it is reaching out to the correct ip/fqdn)


    On the Primary/Backup node: tcpdump-uw -i vmkX | grep <witness IP/FQDN>
    (vmkX is the vmk for witness traffic)

    If working correctly, you will see it reaching out over port 12321 and also the response.

    Example:


    If there is no connectivity over port 12321, get this resolved first. If there is connectivity proceed.

    The traffic on port 12321 is required to be enabled bidirectionally for the vSAN Cluster Monitoring, Membership, and Directory Service (CMMDS) to exchange the CMMDS heartbeats for all types of vSAN cluster architecture.



     
  • Once this is verified follow these steps: 

    1. Put the witness appliance in maintenance mode with Ensure Accessibility.

    2. Disable the stretched cluster in the GUI. Configure > VSAN > Fault Domains and Stretched Clusters.

    3. SSH Into the witness appliance and manually dismantle the disk group. (See How to manually remove and recreate a vSAN disk group using esxcli (2150567)  for details on this process)
 
IMPORTANT: Before dismantling any disk group, ensure you are on the correct host and targeting the correct disk group. Maintenance Mode with Ensure Accessibility is recommended. 
    
    esxcli vsan storage remove -u <VSAN Disk Group UUID>
                        or
  esxcli vsan storage remove -s <VSAN Disk Group Cache Identifier>

   
4. Re-enable the stretched cluster and follow the Wizard and have it create new disks to house the witness components. 

This should re-form the cluster successfully and allow the witness components to re-build on the newly created virtual disks. If this fails, then you may need to re-deploy the witness appliance.

  • Sometimes user might accidentally configure 'witness' traffic type on the vmkernel adapter of Witness Appliance itself, please note that for the WTS (Witness Traffic Separation) setup, this tag is to be used on the data nodes only. 'Witness' traffic type should be removed from the vmkernel adapter (vmk) if found to be present on a vmkernel adapter of the Witness Appliance. 

5. If the above networking issue is still not fixed then deploy new the vSAN witness appliance then replace the same. 

Refer below documents for more details. 

Additional Information