vSAN host experiences network partition or HA events when there is memory (DIMM) issue.
search cancel

vSAN host experiences network partition or HA events when there is memory (DIMM) issue.

book

Article ID: 388421

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

Host part of vSAN cluster experienced network partition or HA events when there is a physical memory (DIMM) errors. 

The below log snippets shows the CPU issues and host is being removed and being added to the cluster membership.

/var/run/log/vmkernel.log

<timestamp> cpu1:85357306)WARNING: Heartbeat: 827: PCPU 35 didn't have a heartbeat for 21 seconds, timeout is 14, 2 IPIs sent; *may* be locked up.
<timestamp> cpu50:2099267)WARNING: Heartbeat: 827: PCPU 84 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
<timestamp> cpu15:85384071)WARNING: Heartbeat: 827: PCPU 55 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up

Due to network latency, vSAN may have experienced a cluster partition issue. To verify  network partition issue check for below  events in  clomd.log

/var/run/log/clomd.log

<timestamp> info clomdb[2099229] [Originator@6876] CdbHandleRemoveEntry: Removing 63bc1299-518c-ee46-XXXX-############ of type CdbObjectNode from CLOMDB.
<timestamp> info clomdb[2099229] [Originator@6876] CdbAddTableEntry: Added 63bc1299-518c-ee46-XXXX-#### of type CdbObjectNode to CLOMDB, FD:00000000-0000-0000-0000-000000000000.
<timestamp> info clomdb[2099229] [Originator@6876] CdbHandleRemoveEntry: Removing 63bc1237-1fc2-560e-XXXX-#### of type CdbObjectNode from CLOMDB.
<timestamp> info clomdb[2099229] [Originator@6876] CdbAddTableEntry: Added 63bc1237-1fc2-560e-XXXX-############ of type CdbObjectNode to CLOMDB, FD:63bc1237-1fc2-560e-XXXX-############.

 

In /var/log/vobd.log, you can see a Machine Check Exception from a memory module. 

<timestamp> [cpuCorrelator] 16814427116585us: [vob.cpu.mce.log4] MCE bank 8: status:0x9c00004001010090 misc:0x200005c280001086 addr:0x56e0615200 cpu:25 physAddr:0x56e0615200 physSize:0x40 ceCount:0x1

 

vCenter UI under vSAN health may not show any cluster partition. However, the cluster may experience resync and data health issues. 

The VMs may experience performance issues and may fail to migrate the VMs manually and using DRS.

Environment

VMware vSAN 7.X
VMware vSAN 8.X

Cause

This can be due to the hardware issues on physical server (Memory / DIMM) and CPUs.

Resolution

Fix the hardware issues on server to address the issue.

Additional Information