/var/run/log/vmkernel.log
Heartbeat Timeouts in related to the Host(s) partitioned from the Cluster:
cpu47:2099098)CMMDS: LeaderCheckNode:9432: e40f0888-###-###-###-###: Lost contact with 61f48823-###-###-###-###
cpu47:2099098)CMMDS: CMMDSHeartbeatCheckHBLogWork:786: e40f0888-###-###-###-###: Check node returned Failure for node 61f48823-###-###-###-### count 10 unhealthy 0
cpu47:2099098)CMMDS: CMMDSStateDestroyNode:706: e40f0888-###-###-###-###: Destroying node 61f48823-###-###-###-###: Heartbeat timeout
cpu47:2099098)CMMDS: LeaderRemoveNodeFromMembership:7965: e40f0888-###-###-###-###: Removing node 61f48823-###-###-###-### (vsanNodeType: data) from the cluster membership
cpu47:2099098)CMMDS: CMMDSClusterDestroyNodeImpl:264: Destroying node 61f48823-###-###-###-### from the cluster db. Last HB received from node
The partitioned Host(s) then tries to rejoin the Cluster but fails during the Leader's sanity check
The Leader Host initiates the Node Destruction procedure after identifying an existing CMMDS_TYPE_NODE entry for the Host.
(= That entry should be actually be deleted during this procedure, but that process has not been completed)
cpu47:2099098)CMMDSNet: CMMDSNetGrpMsgFilter:2379: e40f0888-###-###-###-###: Creating node 61f48823-###-###-###-### from host unicast channel.
cpu47:2099098)CMMDSNet: CMMDSNetGrpMsgFilter:2399: e40f0888-###-###-###-###: Recv first HB msg (seq num = 9) from node 61f48823-###-###-###-### at time 26370891511245470
cpu47:2099098)CMMDS: LeaderAddNodeToMembership:7883: e40f0888-###-###-###-###: Added node 61f48823-###-###-###-### (vsanNodeType: data) to the cluster membership
cpu47:2099098)CMMDS: LeaderSanityCheckRejoinNode:3759: e40f0888-###-###-###-###: Destroy node (61f48823-###-###-###-###) because its CMMDS_TYPE_NODE entry (uuid: 61f48823-###-###-###-###, owne$
cpu47:2099098)CMMDS: CMMDSStateDestroyNode:706: e40f0888-###-###-###-###: Destroying node 61f48823-###-###-###-###: Protocol violation
cpu17:2097842)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823###-###-###-###) from Invalid to Discovery: (Reason: State machine initialization)
cpu50:2099036)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-###) from Discovery to Rejoin: (Reason: Found a leader node)
cpu50:2099036)CMMDS: CMMDSLogStateTransition:1917: e40f0888-###-###-###-###: Transitioning(61f48823-###-###-###-###) from Rejoin to Discovery: (Reason: Failed to receive from node)
vsantraces on Leader Host:
Reveals that the sanity check failure causes an instant node destruction, making it impossible for the partitioned Host(s) to properly rejoin the Cluster.
The Host(s) continuous failure to reintegrate into the Cluster is seen via:
[205775720] [cpu9, node1] [] CMMDSTraceDestroyNode:704: {'uuid': '61f48823-###-###-###-###', 'reason': 'Protocol violation', 'node': 0x43275d80b2e0, 'subClusterUuid': 'e40f0888-###-###-###-###'}
[205775721] [cpu9, node1] [] CMMDSTraceLeaderRemovedMember:7962: {'uuid': '61f48823-###-###-###-###', 'node': 0x43275d80b2e0, 'subClusterUuid': 'e40f0888-###-###-###-###'}
/var/run/log/vmkernel.log
Indicates that the Leader Host is rejecting takeover requests from the Backup Host.
Reason: Leader Host's preferred fault domain entry is set to NULL
cpu21:2099098)CMMDS: CMMDSLeaderlikeBackupShouldTakeOverCluster:1594: e40f0888###-###-###-###: Backup takeover invalid. PreferredFD is NULL