Uncorrectable memory errors in VMware ESXi can cause virtual machine crashes
VMware ESXi
An uncorrectable memory error (e.g., SRAR Exception in Level 0 Data Cache) in the host’s hardware can trigger ESXi to terminate a virtual machine (VM) to prevent system instability, even if VM monitoring is disabled in the High Availability (HA) settings. When Host Failure monitoring is enabled, HA may restart the affected VM on another host in the cluster to maintain availability. This behavior occurs because HA prioritizes host-level failure recovery, which can include restarting VMs impacted by hardware failures, such as those indicated by Machine Check Architecture (MCA) errors in the vmkernel logs.
Log Snippet (vmkernel dump):
xxxx-xx-xxTxx:xx:xx.247Z cpu44:2115517)ALERT: MCA: 190: SRAR Excp Gf B1 Sbd80000000100134 A86c1a0be00 M86 P86c1a0be00/40 Cache Hierarchy: Level 0 Data Cache DataRead Error. IDT: 1804: Uncorrectable/recoverable memory error in virtual machine vmm3:VMname; recovering by killing the VM.
xxxx-xx-xxTxx:xx:xx.247Z cpu44:2116016)MCAIntel: 1287: Force retiring MPN 0x86c1a0b to recover from MCA error detected by cpu44 in bank1.
Another example of a failure event (memory) that resulted in a VM being rebooted :xxxx-xx-xxTxx:xx:xx.548Z info hostd[24540622] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 3658 : Issue detected on <esxi_host> in ha-datacenter: MCA: 200: SRAR Excp G7 B1 Sbd80000000100134 A1315da19980 M86 P1315da19980/40 Cache Hierarchy: Level 0 Data Cache DataRea (2025-04-07T11:23:36.546Z cpu75:27422700)xxxx-xx-xxTxx:xx:xx.548Z info hostd[24540622] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 3659 : Issue detected on <esxi_host> in ha-datacenter: IDT: 1560: Uncorrectable/recoverable memory error in virtual machine vmm2:<virtual machine>; recovering by killing the VM
Reach out to the hardware vendor to review and perform diagnostics on the failing component.
Note: Do not bring the host online until the issue is resolved to avoid further incidents