Machine Check Exception
".VMware ESXi 7.#.# [Releasebuild-22348816 x86_64]
Machine Check Exception on PCPU## in world ######:idle51
System has encountered a Hardware Error - Please contact the hardware vendor
Uncorrectable/recoverable memory error in world ####; unable to recover in kernel context
Data Cache DataRead Error
/var/run/log/vmkernel.log
, you may see entries similar to:
YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.
YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)MCAIntel: 1120: Force retiring MPN ###### to recover from MCA error detected by cpu## in bank1.
YYYY-MM-DDTHH:MM:SS.252Z cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.
YYYY-MM-DDTHH:MM:SS.252Z cpu##:40848027)MCAIntel: 1120: Force retiring MPN ######to recover from MCA error detected by cpu## in bank1.
YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)IDT: 1895: Uncorrectable/unrecoverable machine check error
YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)MCA: 208: UC Excp G4 86 Sbb00002000000e0b AB M180008 P8/8 I/O error reported by PCI 0000:00:03.0.
The Machine Check Architecture (MCA) is a CPU feature designed to detect and report hardware anomalies. When the hardware detects a critical or fatal condition, it raises a Machine Check Exception (MCE). These exceptions are considered severe and unrecoverable, which leads to an expected ESXi host crash, often resulting in a Purple Screen of Death (PSOD).
In this scenario, the MCE was categorized as an SRAR (System Reset Assert Register), which denotes:
Uncorrectable: The error cannot be automatically corrected by hardware.
Recoverable: A system-level action could theoretically mitigate the issue.
Action Required: Specific corrective steps, such as terminating the thread accessing the affected Memory Page Number (MPN), are needed.
The faulty thread was executing within the vmkernel context the ESXi host was unable to isolate or terminate it. This results in the MCE being escalated to a fatal system error, leading to a crash.