Host halts with a purple diagnostic screen(PSOD - purple screen of death) referencing Machine Check Exception (MCE)
search cancel

Host halts with a purple diagnostic screen(PSOD - purple screen of death) referencing Machine Check Exception (MCE)

book

Article ID: 372284

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When a purple diagnostic screen(PSOD) occurs on an ESXi host you may see a reference to "Machine Check Exception".

  • At the ESXi console, the purple diagnostic screen will have entries similar to:

    VMware ESXi #.#.# [Releasebuild-######## x86_64]
    Machine Check Exception on PCPU## in world ######:idle51
    System has encountered a Hardware Error - Please contact the hardware vendor


    Uncorrectable/recoverable memory error in world ####; unable to recover in kernel context
    Data Cache DataRead Error

  • In the /var/run/log/vmkernel.log, you may see entries similar to:

    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.

    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)MCAIntel: 1120: Force retiring MPN ###### to recover from MCA error detected by cpu## in bank1.
    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)ALERT: MCA: 200: SRAR Excp G7 B1 ###### Cache Hierarchy: Level 0 Data Cache DataRead Error.
    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)MCAIntel: 1120: Force retiring MPN ######to recover from MCA error detected by cpu## in bank1.

  • The error can also caused by a failing hardware device. In such case PSOD screen may report error similar to

    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)IDT: 1895: Uncorrectable/unrecoverable machine check error
    YYYY-MM-DDTHH:MM:SS.FFFZ cpu##:40848027)MCA: 208: UC Excp G4 86 Sbb00002000000e0b AB M180008 P8/8 I/O error reported by PCI 0000:00:03.0.

  •  

Cause

The Machine Check Architecture (MCA) is a CPU feature designed to detect and report hardware anomalies. When the hardware detects a critical or fatal condition, it raises a Machine Check Exception (MCE). These exceptions are considered severe and unrecoverable, which leads to an expected ESXi host crash, often resulting in a Purple Screen of Death (PSOD).

In this scenario, the MCE was categorized as an SRAR (Software Recoverable Action Required), which denotes:

Uncorrectable: The error cannot be automatically corrected by hardware.

Recoverable: A system-level action could theoretically mitigate the issue.

Action Required: Specific corrective steps, such as terminating the thread accessing the affected Memory Page Number (MPN), are needed.


The faulty thread was executing within the vmkernel context the ESXi host was unable to isolate or terminate it. This results in the MCE being escalated to a fatal system error, leading to a crash.

Resolution

  • Reboot the Host: Restarting the ESXi host may temporarily restore functionality if the issue was caused by a transient hardware event.

  • Engage Hardware Vendor: Contact your server hardware vendor with the captured MCE/PSOD data. A thorough hardware-level investigation is required to identify the root cause of the MCE and assess if hardware replacement or firmware updates are necessary.

Additional Information

Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)


Japanese Version:
ホストがマシンチェック例外 (MCE) に関する紫色の診断画面 (PSOD - purple screen of death) で停止する