3 node vSAN cluster, 1 node down with most VMs in an invalid state following reinstalling ESXi on 1 host
search cancel

3 node vSAN cluster, 1 node down with most VMs in an invalid state following reinstalling ESXi on 1 host

book

Article ID: 395527

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

3 host vSAN cluster 1 host is partitioned after ESXi being reinstalled, and will not join the cluster. vmkping works between the vSAN vmkernel ports of all nodes, and tcpdump-uw shows bidirectional traffic on port 12321 between the vSAN vmkernel ports of all nodes.

The cluster partition is verified by running the following on each host:
esxcli vsan cluster get

Result on a host clustered with others:

Cluster Information
   ...
   Sub-Cluster UUID: ########-####-####-####-############   <--- Matches between partitioned host and rest of cluster
   Sub-Cluster Membership Entry Revision: 3
   Sub-Cluster Member Count: 2

On the partitioned host:

Cluster Information
   ...
   Sub-Cluster UUID: ########-####-####-####-############   <--- Matches between partitioned host and rest of cluster
   Sub-Cluster Membership Entry Revision: 3
   Sub-Cluster Member Count: 1

Many virtual machines show an Invalid status in the host client UI.

vSAN data objects are inaccessible status.

 

Environment

vSAN - all versions

Cause

  • Incorrect shut down procedure used resulted in one host having more recent updates of data components than other hosts.
    • This resulted in available data components being marked as STALE on clustered hosts, and ABSENT for the host partitioned from the cluster.
    • The combination of components being ABSENT and STALE resulted in the data objects not having quorum and being marked as inaccessible.

  • Incompatible CMMDS versions between hosts prevented the host with the most recently updated components from joining the cluster, resulting in a cluster partition.
    • The host with the most recent data components had ESXi reinstalled at a lower version than the other hosts which had no changes (example: installed at 6.7 GA while the others were on 6.7 Update 2).
    • This version difference resulted in incompatible CMMDS versions between hosts preventing the lower version from clustering with the higher and causing a cluster partition as verified by the log messages in vmkernel.log:
      • WARNING: CMMDS: CMMDSPromoteFormatVersion:423: Failed to promote the node to a format version X beyond its software version Y

Resolution

Setting all hosts to identical/compatible CMMDS versions allow the hosts to cluster together.

Update the host at the lower build to a build equal to or higher than the other hosts in the cluster, and validate the cluster has formed. 

Once the cluster is formed the data is accessible.

Additional Information

See KB 326888 for more information about the Failure to promote CMMDS message and resulting cluster partition. 
See KB 327049 for more information about data risk following cluster shutdown and restart without the proper preparations.