For the host getting into partitioned state and joining cluster , Admin can confirm from logging that CMMDS is having issues:
# grep 'arena space' /var/log/vmkernel.log
Confirm via CLI that the cluster membership is incrementing by running this command a few times a few seconds apart:
#localcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2019-07-28T03:26:18Z
Local Node UUID: < ABC >
....... output is snipped .......
Sub-Cluster Membership Entry Revision: 5 >> Starts incrementing on impacted hosts
Sub-Cluster Member Count:
Config Generation: < XYZ > 18 2019-07-08T18:05:37.650 >> Host State change time
The “epd” service not running due to lock and it can be verified in the /var/log/epd.log
by checking following traces
2019-04-13T21:23:44.110Z 70451 -- XXX: Lock file '/scratch/epd-store.db' exists.
2019-04-13T21:23:44.110Z 70451 -- XXX: Did the host or EPD recently crash?
2019-04-13T21:23:44.110Z 70451 -- XXX: Assuming it's OK. Unlinking lock file..
2019-04-13T21:23:44.111Z 70451 Failed to delete lock file: Is a directory
2019-04-13T21:23:44.111Z 70451 SRV: failed to open db: Failure
2019-04-13T21:23:44.111Z 70451 SRV: init for store-mgmt failed: Failure
2019-04-13T21:23:44.111Z 70451 SRV: initialization failed: Failure
Error messages such as the following in ESXi /var/run/log/vmkernel.log :2018-05-13T13:36:03.028Z cpu16:67526)CMMDS: RejoinBuildSnapshotEntry:2043: Failed to allocate arena space for snapshot2018-05-13T13:36:03.028Z cpu16:67526)CMMDS: CMMDSLogStateTransition:1309: Transitioning(xxxxxxxx-1f8f-1364-8400-xxxxxxxxxxxx) from Rejoin to Discovery: (Reason: Arena memory exhausted)2018-05-13T13:59:20.463Z cpu46:67519)CMMDS: RejoinBuildSnapshotEntry:2043: Failed to allocate arena space for snapshot2018-05-13T13:59:20.463Z cpu46:67519)CMMDS: CMMDSLogStateTransition:1309: Transitioning(xxxxxxxx-f728-bd70-67aa-xxxxxxxxxxxx) from Rejoin to Discovery: (Reason: Arena memory exhausted)
VMware vSAN 6.x
VMware vSAN 7.x
EPD service not running and thus not cleaning up discarded components, which in turn causes CMMDS Arena resources to become exhausted and CMMDS to become unstable.
The cause of the initial EPD service issues relate to how this service starts on boot and where it writes on start up.
The drivers with known issues such as Emulex elxiscsi 11.2.1152.0-1OEM.650.0.0.4240417 can break the configured scratch location and as a result impact how this service starts.
Following misconfiguration have been found to cause the issue when the vSAN service requires scratch to execute :
The Scratch partition/location is not available for logging.
H730P controllers with incorrect disk-access mode and scratch partition is not available/partially available for log.
Unsupported mixed disk-access modes (e.g. RAID1 boot/log devices + passthrough vSAN devices) and scratch partition is not available/partially available for log.
The ESXI version ESXi 7.0 Update 3f (Build 20036589) and onwards higher version has enhancements in terms of handling of DISCARDED COMPONENTS.
If issue is observed in latest builds, the above symptoms and issue matches, please contact Broadcom Support to investigate the issue and workaround.