vSAN Witness appliance partitioned from stretch cluster
search cancel

vSAN Witness appliance partitioned from stretch cluster

book

Article ID: 326958

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

The purpose of this KB is to provide some proven troubleshooting steps during this scenario that do not involve re-deploying the entire witness appliance.

Symptoms:
Witness appliance experiences some type of failure or network partition and then will not re-join the cluster, even when disabling/re-enabling the cluster

Environment

VMware vSAN

Resolution

To resolve this issue:
 
First verify the witness is on the exact same version of ESXi as the data nodes. If it isn't, update the node using your standard practices for upgrading nodes to match the data nodes: Upgrading ESXi Hosts 
 
Then verify that there are no actual communication issues between the data nodes and the witness appliance before proceeding. (See Testing VMkernel network connectivity with the vmkping command (1003728)  for more details on using vmkping)

To verify connectivity perform vmkping tests between the vsan vmks.  

    To test 1500 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 1472
    To test 9000 MTU, run the command: vmkping -I vmkX x.x.x.x -d -s 8972

In addition, check the connectivity between the witness appliance and the data nodes via port 12321.

On the witness: tcpdump-uw -i vmkX (vmkx is the vmk port that is used for witness traffic)

If there is connectivity you will see incoming requests and responses over port 12321

Example:



On the data node: tcpdump-uw -i vmkX | grep <witness IP> (vmkX is the vmk for witness traffic)

If working correctly, you will see it reaching out over port 12321 and also the response.

Example:


If there is no connectivity over port 12321, get this resolved first. If there is connectivity proceed.

Once this is verified follow these steps: 

1. Put the witness appliance in maintenance mode with Ensure Accessibility

2. Disable the stretched cluster in the GUI. Configure > VSAN > Fault Domains and Stretched Clusters

3. SSH Into the witness appliance and manually dismantle the disk group. (See How to manually remove and recreate a vSAN disk group using esxcli (2150567)  for details on this process)
 
IMPORTANT: Before dismantling any disk group, ensure you are on the correct host and targeting the correct disk group. Maintenance Mode with Ensure Accessibility is recommended. 
    
    esxcli vsan storage remove -u <VSAN Disk Group UUID>
                        or
  esxcli vsan storage remove -s <VSAN Disk Group Cache Identifier>

   
4. Re-enable the stretched cluster and follow the Wizard and have it create new disks to house the witness components. 

This should re-form the cluster successfully and allow the witness components to re-build on the newly created virtual disks. If this fails then you may need to re-deploy the witness appliance.

Additional Information

Impact/Risks:
Before disabling the stretched cluster, always confirm that all other fault domains are up and accessible.