vSAN Cluster Partition -- One or more Hosts partitioned -- Not shown on Leader Host via "esxcli vsan cluster get"

Products

VMware vSAN 7.x VMware vSAN 8.x

Issue/Introduction

Key Symptoms:

On an existing Cluster, one or more Host(s) are partitioned from the Cluster. Detected by running "esxcli vsan cluster get" on all vSAN Hosts
"esxcli vsan cluster get" on the Leader Host does not confirm a Cluster Partition (= Member count represents all existing vSAN Hosts incl. any Witness Hosts)
Preferred Fault Domain entries on Leader Host is empty (= Use the "Sub-Cluster UUID" from the "esxcli vsan cluster get" on the Leader Host):
when running the ls command on the datastore, it shows up empty on any node that still shows not partitioned.

vsish -e get /vmkModules/cmmds/subClusters/SubClusterUUID/preferredFD

Potential additional Symptoms (= one or more of the following):

Web Client and/or CLI commands are returning the error "Too many outstanding requests"
Web Client might present wrong information. Examples: Datstore size is zero, no DGs claimed
Not possible to browse Datastore (= An error occurred. Please try again)
vSAN related commands are hanging or returning a Timeout error
vSAN Host(s) showing high CPU usage
esxcli vsan debug disk overview: No disk issues
esxcli vsan debug: All Objects healthy. No Resync
Unicast agent list: No missing Hosts
IgnoreClusterMemberListupdates is set to default (= "0")
vmkping: No Packet loss
One or more vSAN Host(s) are stuck in booting up. vmkernel shows a flood of repeating messages of joining the Cluster along with loosing Leader Node

Examples From the Logs:

/var/run/log/vmkernel.log

Heartbeat Timeouts in related to the Host(s) partitioned from the Cluster:

cpu47:2099098)CMMDS: LeaderCheckNode:9432: e40f0888-###-###-###-###: Lost contact with 61f48823-###-###-###-###
cpu47:2099098)CMMDS: CMMDSHeartbeatCheckHBLogWork:786: e40f0888-###-###-###-###: Check node returned Failure for node 61f48823-###-###-###-### count 10 unhealthy 0
cpu47:2099098)CMMDS: CMMDSStateDestroyNode:706: e40f0888-###-###-###-###: Destroying node 61f48823-###-###-###-###: Heartbeat timeout
cpu47:2099098)CMMDS: LeaderRemoveNodeFromMembership:7965: e40f0888-###-###-###-###: Removing node 61f48823-###-###-###-### (vsanNodeType: data) from the cluster membership
cpu47:2099098)CMMDS: CMMDSClusterDestroyNodeImpl:264: Destroying node 61f48823-###-###-###-### from the cluster db. Last HB received from node

The partitioned Host(s) then tries to rejoin the Cluster but fails during the Leader's sanity check

The Leader Host initiates the Node Destruction procedure after identifying an existing CMMDS_TYPE_NODE entry for the Host.

(= That entry should be actually be deleted during this procedure, but that process has not been completed)

cpu47:2099098)CMMDSNet: CMMDSNetGrpMsgFilter:2379: e40f0888-###-###-###-###: Creating node 61f48823-###-###-###-### from host unicast channel.
cpu47:2099098)CMMDSNet: CMMDSNetGrpMsgFilter:2399: e40f0888-###-###-###-###: Recv first HB msg (seq num = 9) from node 61f48823-###-###-###-### at time 26370891511245470
cpu47:2099098)CMMDS: LeaderAddNodeToMembership:7883: e40f0888-###-###-###-###: Added node 61f48823-###-###-###-### (vsanNodeType: data) to the cluster membership
cpu47:2099098)CMMDS: LeaderSanityCheckRejoinNode:3759: e40f0888-###-###-###-###: Destroy node (61f48823-###-###-###-###) because its CMMDS_TYPE_NODE entry (uuid: 61f48823-###-###-###-###, owne$
cpu47:2099098)CMMDS: CMMDSStateDestroyNode:706: e40f0888-###-###-###-###: Destroying node 61f48823-###-###-###-###: Protocol violation
cpu17:2097842)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823###-###-###-###) from Invalid to Discovery: (Reason: State machine initialization)
cpu50:2099036)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-###) from Discovery to Rejoin: (Reason: Found a leader node)
cpu50:2099036)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-###) from Rejoin to Discovery: (Reason: Failed to receive from node)

vsantraces on Leader Host:

Reveals that the sanity check failure causes an instant node destruction, making it impossible for the partitioned Host(s) to properly rejoin the Cluster.

The Host(s) continuous failure to reintegrate into the Cluster is seen via:
[205775720] [cpu9, node1] [] CMMDSTraceDestroyNode:704: {'uuid': '61f48823-###-###-###-###', 'reason': 'Protocol violation', 'node': 0x43275d80b2e0, 'subClusterUuid': 'e40f0888-###-###-###-###'}
[205775721] [cpu9, node1] [] CMMDSTraceLeaderRemovedMember:7962: {'uuid': '61f48823-###-###-###-###', 'node': 0x43275d80b2e0, 'subClusterUuid': 'e40f0888-###-###-###-###'}

/var/run/log/vmkernel.log

Indicates that the Leader Host is rejecting takeover requests from the Backup Host.

Reason: Leader Host's preferred fault domain entry is set to NULL

cpu21:2099098)CMMDS: CMMDSLeaderlikeBackupShouldTakeOverCluster:1594: e40f0888###-###-###-###: Backup takeover invalid. PreferredFD is NULL

Environment

VMware vSAN 7.x
VMware vSAN 8.x

Cause

The following situation represents the cause of the issue:

Leader Host's preferred fault domain entry is set to NULL.

The Backup Leader Host repeatedly attempts to take over, but the Leader rejects the requests as invalid due to the missing preferredFD

A CMMDS race condition during takeover stalls the Leader Host, preventing checkpoint queue drainage

Updates from the Leader Host to all other Hosts halt, though heartbeats continue to maintain cluster communication.

The CMMDS workblock race condition prevents the sendCheckpoint workblock from reactivating,

leading to takeover request failures, updates not being checkpointed, and stale node entries persisting in CMMDS database.

LeaderSanityCheckRejoinNode rejects rejoining nodes due to persistent stale CMMDS_TYPE_NODE entries.

The Leader Host misidentifies them as active, causing Cluster inconsistencies, blocked rejoins, and service disruptions.

Resolution

1.) Place the Leader Host (= CMMDS Master) into Maintenance Mode

2.) Forcefully abdicate the Leader Role by running on Leader Host:

6.7U3 P04 - 7.0U1: # vsish -e set /vmkModules/cmmds/forceTransition abdicateMaster

7.0U2 & higher: # vsish -e set /vmkModules/cmmds/forceTransition abdicateLeader

3.) /var/run/log/vmkernel.log confirms that the Backup Leader assumes the Leadership role, and any previously partitioned Host(s) now joining the Cluster

cpu47:2099098)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-### from Discovery to Rejoin: (Reason: Found a leader node)
cpu22:2099098)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-###) from Rejoin to Agent: (Reason: The local node has finished rejoining)

4.) esxcli vsan cluster get confirms that Cluster is fully formed (= all Hosts have joined the Cluster)

Bugfix is available via:

Affected Cluster is on 8.x: Upgrade to vSAN 8.0U2b & later

Affected Cluster is on 7.x: Upgrade to vSAN 7.0U3 P10