Host part of vSAN cluster experienced network partition or HA events when there is a physical memory (DIMM) errors.
The below log snippets shows the CPU issues and host is being removed and being added to the cluster membership.
/var/run/log/vmkernel.log
<timestamp> cpu1:85357306)WARNING: Heartbeat: 827: PCPU 35 didn't have a heartbeat for 21 seconds, timeout is 14, 2 IPIs sent; *may* be locked up.
<timestamp> cpu50:2099267)WARNING: Heartbeat: 827: PCPU 84 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
<timestamp> cpu15:85384071)WARNING: Heartbeat: 827: PCPU 55 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up
Due to network latency, vSAN may have experienced a cluster partition issue. To verify network partition issue check for below events in clomd.log
/var/run/log/clomd.log
<timestamp> info clomdb[2099229] [Originator@6876] CdbHandleRemoveEntry: Removing 63bc1299-518c-ee46-XXXX-############ of type CdbObjectNode from CLOMDB.
<timestamp> info clomdb[2099229] [Originator@6876] CdbAddTableEntry: Added 63bc1299-518c-ee46-XXXX-#### of type CdbObjectNode to CLOMDB, FD:00000000-0000-0000-0000-000000000000.
<timestamp> info clomdb[2099229] [Originator@6876] CdbHandleRemoveEntry: Removing 63bc1237-1fc2-560e-XXXX-#### of type CdbObjectNode from CLOMDB.
<timestamp> info clomdb[2099229] [Originator@6876] CdbAddTableEntry: Added 63bc1237-1fc2-560e-XXXX-############ of type CdbObjectNode to CLOMDB, FD:63bc1237-1fc2-560e-XXXX-############.
In /var/log/vobd.log, you can see a Machine Check Exception from a memory module.
<timestamp> [cpuCorrelator] 16814427116585us: [vob.cpu.mce.log4] MCE bank 8: status:0x9c00004001010090 misc:0x200005c280001086 addr:0x56e0615200 cpu:25 physAddr:0x56e0615200 physSize:0x40 ceCount:0x1
vCenter UI under vSAN health may not show any cluster partition. However, the cluster may experience resync and data health issues.
The VMs may experience performance issues and may fail to migrate the VMs manually and using DRS.
This can be due to the hardware issues on physical server (Memory / DIMM) and CPUs.
Fix the hardware issues on server to address the issue.