vSAN cluster may partition during upgrade if promotion of CMMDS version fails
search cancel

vSAN cluster may partition during upgrade if promotion of CMMDS version fails

book

Article ID: 326888

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • During an upgrade of a vSAN cluster one or more nodes become partitioned from the rest of the cluster, forming one or more cluster partitions.

  • No indication of network communication issues (e.g. vmkping between nodes on the vSAN network succeeds).

  • vmkernel.log for the ESXi hosts which are partitioned show messages similar to the below - where X is the version being promoted to and Y is the version to be promoted from.

  • WARNING: CMMDS: CMMDSPromoteFormatVersion:423: Failed to promote the node to a format version X beyond its software version Y 
  • A cluster partition may also happen due to minNodeMajorVersion mismatch across the hosts in the vSAN cluster.
  • This issue can also be observed when a new host is added to the vSAN cluster which is on a higher ESXi version than the existing nodes, either with a disk group still present or when creating new disk groups.
  • This issue can also be observed on stretched vSAN clusters using a shared witness environment, if the on-disk format (ODF) version is upgraded on the witness node before the ESXi hosts.

 

Environment

vSAN 7.x, 8.x, 9.x

 

Cause

Recreating or adding disk groups to the cluster which are using an on-disk format (ODF) version higher than the rest of the cluster causes the CMMDS version on these nodes to be updated,. These nodes are then non-compatible with the nodes that have not been upgraded yet (as they are unable to use later versions of CMMDS). 

Removing the higher ODF disk groups will not resolve the issue as this will not revert the CMMDS version in use.
Setting virsto version to legacy format will not resolve the issue as this will not revert the CMMDS version in use.

 

Whenever a new node is added to the cluster, or if an existing node is moved out of the cluster and re-added back, the minNodeMajorVersion should be on the same version on all of the ESXi hosts. If it is not, it will trigger this cluster partition issue and could cause VMs to become inaccessible.

minNodeMajorVersion can be verified in the CLI using the below command

/usr/lib/vmware/vsan/bin/clom-tool stats | grep "minNodeMajorVersion"

Resolution

This issue occurs where nodes have incompatible CMMDS versions .

This issue can be avoided by not adding/creating/re-creating disk groups to a higher format until all hosts have been upgraded to the same ESXi build - if disk groups have to be recreated during the upgrade, then temporarily set the virsto version to use the same ODF version that all other hosts in the cluster are on. These changes should be reverted once all hosts have been upgraded:

How to format vSAN Disk Groups with a legacy format version

Understanding vSAN on-disk format versions and compatibility

Workaround:

If problem has already happened, there are two options to deal with it:

  • Move forward and update the remaining nodes in the cluster. Note: This may cause further temporary data inaccessibility as depending on how the cluster partitioned, the updated nodes may be joining the cluster partition that does not have the majority of the data accessible, and following update it will no longer be able to communicate with the lower version nodes that it was clustered with prior to updating.

OR

  • Roll-back/re-install the previous version of ESXi on the nodes with the higher version of CMMDS. Before considering rollback option, validate that there the lower build version is still available by checking the contents of /altbootbank/boot.cfg .

         If this option is chosen then the disk groups created on a higher ODF version will need to be removed prior to rollback/re-install so as to not cause further issues.