Host halts with a purple diagnostic screen(PSOD) referencing Machine Check Exception (MCE)
search cancel

Host halts with a purple diagnostic screen(PSOD) referencing Machine Check Exception (MCE)

book

Article ID: 372284

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When a purple diagnostic screen(PSOD) occurs on an ESXI host you may see a reference to "Machine Check Exception".

  • At the ESXi console, the purple diagnostic screen will have entries similar to:

    VMware ESXi 7.#.# [Releasebuild-22348816 x86_64]
    Machine Check Exception on PCPU## in world ######:idle51
    System has encountered a Hardware Error - Please contact the hardware vendor


    Uncorrectable/recoverable memory error in world ####; unable to recover in kernel context
    Data Cache DataRead Error

  • In the /var/run/log/vmkernel.log, you may see entries similar to:

    YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.

    YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)MCAIntel: 1120: Force retiring MPN ###### to recover from MCA error detected by cpu## in bank1.
    YYYY-MM-DDTHH:MM:SS.252Z cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.
    YYYY-MM-DDTHH:MM:SS.252Z cpu##:40848027)MCAIntel: 1120: Force retiring MPN ######to recover from MCA error detected by cpu## in bank1.

  • The error can also caused by a failing hardware device. In such case PSOD screen may report error similar to

    YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)IDT: 1895: Uncorrectable/unrecoverable machine check error
    YYYY-MM-DDTHH:MM:SS.114Z cpu##:40848027)MCA: 208: UC Excp G4 86 Sbb00002000000e0b AB M180008 P8/8 I/O error reported by PCI 0000:00:03.0.

  •  

Cause

The Machine Check Architecture (MCA) is a CPU feature designed to detect and report hardware anomalies. When the hardware detects a critical or fatal condition, it raises a Machine Check Exception (MCE). These exceptions are considered severe and unrecoverable, which leads to an expected ESXi host crash, often resulting in a Purple Screen of Death (PSOD).

In this scenario, the MCE was categorized as an SRAR (System Reset Assert Register), which denotes:

Uncorrectable: The error cannot be automatically corrected by hardware.

Recoverable: A system-level action could theoretically mitigate the issue.

Action Required: Specific corrective steps, such as terminating the thread accessing the affected Memory Page Number (MPN), are needed.


The faulty thread was executing within the vmkernel context the ESXi host was unable to isolate or terminate it. This results in the MCE being escalated to a fatal system error, leading to a crash.

Resolution

  • Reboot the Host: Restarting the ESXi host may temporarily restore functionality if the issue was caused by a transient hardware event.

  • Engage Hardware Vendor: Contact your server hardware vendor with the captured MCE/PSOD data. A thorough hardware-level investigation is required to identify the root cause of the MCE and assess if hardware replacement or firmware updates are necessary.

Additional Information