Virtual machines experience performance degradation or become unresponsive when the CMMDS leader ESXi host is rebooted
search cancel

Virtual machines experience performance degradation or become unresponsive when the CMMDS leader ESXi host is rebooted

book

Article ID: 407826

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • VMs slow or unresponsive to commands
  • LUN showing 100% utilization with no active read/write operations.
  • The cluster enters a vSAN network partition state during the leader node reboot.
  • The vSAN iSCSI Target Service (daemon process vitd) ran out of available threadpool resources.
  • VMs experienced connection errors:

mmm dd hh:mm:ss vm_name kernel: [1761594.974483] connection5:0: detected conn error (1020)
mmm dd hh:mm:ss vm_name kernel: [1761594.974482] connection6:0: detected conn error (1020)

Environment

  • VMware vSAN 8.x

Cause

A known CMMDS issue where a leader host undergoing reboot or shutdown continued transmitting leader heartbeats, even though it was no longer able to receive traffic.

Because the other cluster nodes were still receiving these outgoing heartbeats, they continued to follow the rebooting leader instead of failing over to the backup. As a result, a clean leadership transition was prevented, and a full cluster partition was triggered until the leader’s networking stack was fully stopped.

 

Logs Validation:

The issue can be confirmed through the following log observations:

1. Initial Leader State 
  • Leader of vSAN cluster before MM entry: (/var/run/log/vmkernel.log)
2025-08-02T01:42:55.501Z In(182) vmkernel: cpu73:2098937)CMMDS: LeaderBuildHeartbeatMessage:2120: 52cfc3d8-####-####-ebd1-#########: [318070950]:Current membership uuid 33c41967-####-####-23a9-######### has 14 members
2025-08-02T01:42:55.501Z In(182) vmkernel: cpu73:2098937)CMMDS: LeaderBuildHeartbeatMessage:2131: 52cfc3d8-####-####-ebd1-#########: [318070950]:Member[0]:6718c2b8-####-####-fbf3-#########(leader)
2025-08-02T01:42:55.501Z In(182) vmkernel: cpu73:2098937)CMMDS: LeaderBuildHeartbeatMessage:2126: 52cfc3d8-####-####-ebd1-#########: [318070950]:Member[1]:6719af12-####-####-ff6c-#########(backup)
2. Leader Enters Maintenance Mode: (/var/run/log/vobd.log)
2025-08-02T02:26:35.617Z In(14) vobd[2098025]:  [UserLevelCorrelator] 24423759940313us: [esx.audit.maintenancemode.entered] The host has entered maintenance mode.
3. Leader Reboot Initiated: (/var/run/log/vmksummary.log)
2025-08-02T02:26:36.568Z No(13) bootstop[245573871]: Host is rebooting
4. Backup Node Becomes Leader
  • Takes over leadership: (/var/run/log/vmkernel.log)
2025-08-02T02:26:54.480Z In(182) vmkernel: cpu86:2099008)CMMDSNet: CMMDSNet_SetLeader:1315: 52cfc3d8-####-####-ebd1-#########: Updating leader node: old=6718c2b8-####-####-fbf3-######### new=none
2025-08-02T02:26:54.480Z In(182) vmkernel: cpu86:2099008)CMMDSNet: CMMDSNet_SetLeader:1315: 52cfc3d8-####-####-ebd1-#########: Updating leader node: old=none new=6719af12-####-####-ff6c-#########
2025-08-02T02:26:54.480Z In(182) vmkernel: cpu86:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719af12-####-####-ff6c-#########) from Backup to Leader: (Reason: Backup is taking over the cluster leader)
5. Agent Node Behavior
  • Instead of transitioning to a new leader, it drops to discovery: (/var/run/log/vmkernel.log)
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu50:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Agent to Discovery: (Reason: Failed to receive from node)
2025-08-02T02:27:00.064Z In(182) vmkernel: cpu57:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Discovery to Rejoin: (Reason: Found a leader node)
2025-08-02T02:27:00.329Z In(182) vmkernel: cpu57:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Rejoin to Discovery: (Reason: Failed to receive from node)
2025-08-02T02:27:01.814Z In(182) vmkernel: cpu57:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Discovery to Rejoin: (Reason: Found a leader node)
2025-08-02T02:27:03.610Z In(182) vmkernel: cpu57:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Rejoin to Agent: (Reason: The local node has finished rejoining)
  • Cluster membership count reduces: (/var/run/log/vsansystem.log)
2025-08-02T02:26:54.480Z In(166) vsansystem[2532172]: [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-beb4] Complete, nodeCount: 14, runtime info:(vim.vsan.host.VsanRuntimeInfo) {
2025-08-02T02:26:59.480Z In(166) vsansystem[2532157]: [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bf0d] Complete, nodeCount: 13, runtime info:(vim.vsan.host.VsanRuntimeInfo) {
2025-08-02T02:26:59.489Z In(166) vsansystem[2532157]: [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bf10] Complete, nodeCount: 10, runtime info:(vim.vsan.host.VsanRuntimeInfo) {
2025-08-02T02:26:59.495Z In(166) vsansystem[2532157]: [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bf0d] Complete, nodeCount: 9, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-08-02T02:26:59.504Z In(166) vsansystem[2532170]: [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bf19] Complete, nodeCount: 9, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
  • Agent Node detects the loss of its leader host and terminates the active RDT association with it: (/var/run/log/vmkernel.log)
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu59:2099008)CMMDS: CMMDSStateMachineReceiveLoop:1640: 52cfc3d8-####-####-ebd1-#########: Error receiving from 6718c2b8-####-####-fbf3-#########: Failure 2025-08-02T02:26:59
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu59:2099008)CMMDS: CMMDSStateDestroyNode:708: 52cfc3d8-####-####-ebd1-#########: Destroying node 6718c2b8-####-####-fbf3-#########: Failed to receive from node
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu59:2099008)CMMDS: AgentDestroyNode:1660: 52cfc3d8-####-####-ebd1-#########: Lost leader node (6718c2b8-####-####-fbf3-#########), can't handle that and will transition to discovery
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu59:2099008)CMMDSNet: CMMDSNet_SetLeader:1315: 52cfc3d8-####-####-ebd1-1c05 #########: Updating leader node: old=6718c2b8-####-####-fbf3-######### new=none
2025-08-02T02:26:59.486Z In(182) vmkernel: cpu50:2099008)CMMDS: CMMDSLogStateTransition:1824: 52cfc3d8-####-####-ebd1-#########: Transitioning(6719b151-####-####-1fc1-#########) from Agent to Discovery: (Reason: Failed to receive from node)
2025-08-02T02:26:59.486Z Wa(180) vmkwarning: cpu50:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224e33e6c0 message 92888778 failure
2025-08-02T02:26:59.486Z Wa(180) vmkwarning: cpu50:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224e33e6c0 message 92888779 failure
2025-08-02T02:26:59.486Z Wa(180) vmkwarning: cpu50:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224e33e6c0 message 92888780 failure
2025-08-02T02:26:59.486Z Wa(180) vmkwarning: cpu50:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224e33e6c0 message 92888781 failure
2025-08-02T02:27:00.329Z Wa(180) vmkwarning: cpu57:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224eb755c0 message 1 failure
2025-08-02T02:27:00.329Z Wa(180) vmkwarning: cpu57:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224eb755c0 message 2 failure
2025-08-02T02:27:00.329Z Wa(180) vmkwarning: cpu57:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224eb755c0 message 3 failure
2025-08-02T02:27:00.329Z Wa(180) vmkwarning: cpu57:2099008)WARNING: RDT: RDTEndQueuedMessages:1390: assoc 0x43224eb755c0 message 2 failure
  • Based on the heartbeat timestamps on Agent Node, the host continues receiving heartbeats from the old leader until it is transitioned to discovery mode: (/var/run/log/vmkernel.log)
2025-08-02T02:26:59.488Z In(182) vmkernel: cpu50:2099008)CMMDS: CMMDSClusterDestroyNodeImpl:262: Destroying node 6719af12-####-####-ff6c-######### from the cluster db. Last HB received from node - 24349239336243137
6. VM Object Impact
  • DOM loses liveness for impacted VM: (/var/run/log/vmkernel.log)
2025-08-02T02:26:59.481Z In(182) vmkernel: cpu13:2099082)DOM: DOMOwner_SetLivenessState:10887: Object 16043e67-####-####-1a51-######### lost liveness [0x45bb80a3f840]
 

Resolution

  1. This issue is addressed in VMware ESXi 8.0.3 build-24859861 ( ESXi 8.0 P06) and in ESXi 9.0.

    • It is recommended to upgrade the affected ESXi hosts to one of these fixed versions to prevent recurrence.

  2. Workaround:
    • Manually abdicate the vSAN cluster leader role before rebooting:

      Abdicate the leader

      # vsish -e set /vmkModules/cmmds/forceTransition abdicateLeader

      OR
    • Temporarily untag the vSAN vmknic before rebooting:
      • Remove the vSAN traffic tag from the vmknic prior to host reboot.

         To untag vSAN traffic run the below command:

                 #  esxcli network ip interface tag remove -i vmkx -t vSAN

         

      • Re-apply the vSAN tag after the host has successfully rebooted.

         To re-tag after upgrade and reboot run the below command:

               #  esxcli network ip interface tag add -i vmkx -t vSAN