Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)

search cancel

Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)

book

Article ID: 367928

calendar_today

Updated On: 03-04-2025

Products

VMware vSphere ESXi

Issue/Introduction

An ESXi host halts with a purple diagnostic screen
The purple diagnostic screen shows a message similar to:

Machine Check Exception on PCPU42 in world 10021342342
System has encountered a Hardware Error - Please contact the hardware vendor

When extracting the logs from the core dump, and possibly also on the purple diagnostic screen, you see a message similar to:
- ESXi 6.5.x or later:
- cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/4
- ESXi 6.0.x:
- MC:PCPU42 B:4 S:0xbe00000000800400 M:0x41800d55315c A:0x41800d55315c 5

Note: If you experience a purple diagnostic screen which does not mention MC, Machine Check Exception, or Hardware (Machine) Error, see Interpreting an ESXi host purple diagnostic screen (343033).

Environment

6.7
7.0
8.0

Cause

The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a Machine Check Exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen displays, a hardware problem has caused it. There is no other way to generate an MCE.

Resolution

When the system has faults with a purple screen:

Capture the screen output
Reboot the server
Contact your hardware vendor.

In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.

Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors. When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESXi halts with a purple diagnostic screen.

During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. For more information, see Extracting the log file after an ESXi host fails with a purple screen error . If serial-line logging is configured, the same messages are emitted on the serial port.

Machine-Check Architecture Registers:

The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error. The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.

Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.

Identifying register contents

Different versions of ESXi log the machine-check architecture register contents using different formats. For more information, see Determining VMware Software Version and Build Number.

Regardless of the version of ESXi, these items of information should be available:

Physical CPU number
Global status register
Bank number
Bank status register
Bank address register
Bank miscellaneous register

ESXi 6.5 and later:

The log message consists of one line for each bank of interest, including the physical CPU number, the text "MCA:", the error class, how the error was reported, the MCG_STATUS register (G), the bank number (B), the MCi_STATUS register (S), the MCi_ADDR register (A), the MCi_MISC register (M), the decoded system physical address and size (P) in 6.7 and later, and a human-readable interpretation of the error.

cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/40 Cache Hierarchy: Level 0 Data Cache Read Error.

The error class may be one of the following:

UC: Uncorrected, unrecoverable
SRAR: Uncorrected, recoverable, action required (Intel)
SRAO: Uncorrected, recoverable, action optional (Intel)
UCNA: Uncorrected, no action required (Intel)
UCR: Uncorrected, recoverable (AMD)
CE: Corrected
DE: Deferred (AMD)

How the error was reported may be one of the following:

Init: Found during boot-time initialization (possibly from prior to the reboot)
Poll: Periodic polling of the MCA banks
Excp: Machine Check Exception handler
Intr: Corrected Machine Check Interrupt handler

ESXi 6.0:

The log message consists of one line for each bank of interest, including the text "MC:", the physical cpu number (PCPU), the bank number (B), the MCi_STATUS register (S), the MCi_MISC register (M), the MCi_ADDR register (A), and the MCG_STATUS register. MC:PCPU42 B:4 S:0xbe00000000800400 M:0x41800d55315c A:0x41800d55315c 5

Automatic Interpretation:

VMware ESXi attempts to interpret the contents of the status register(s) for display in the log and on the purple diagnostic screen.

For example:

Cache Hierarchy: Level 0 Data Cache Read Error.
Bus error, node originated, read, memory access

Note: Where the automatic interpretation and vendor interpretation disagree, the interpretation of the vendor should be taken as correct. The raw contents of the status registers are also available, so they can be manually reviewed.

Decoding the global MCA status (MCG_STATUS) register

The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.

63 | 3 2 1 0
Reserved MCIP EIPV RIPV

Bit 2: Machine Check In Progress. Identifies whether a machine check is in progress, and whether further fields should be consulted.
Bit 1: Error IP Valid. Identifies whether the instruction pointer pushed on to the stack is directly related to the error.
Bit 0: Restart IP Valid. Identifies whether the program execution can be reliably restarted at the instruction pointer pushed on to the stack.

For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.

Overview of the bank status (MCi_STATUS) register

Each bank'sMCi_STATUSregister contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.

63 62 61 60 59 58 57 56 | 32 31 | 16 15 | 0
VAL OVER UC EN MISCV ADDRV PCC Other Information Extended Error Code MCA Error Code

The high 7 bits 57:63 provide an overview of the processor state, and which of the other registers are meaningful:

Bit 63: VAL. Indicates (when set) that this bank's status (MCi_STATUS) register is valid, and that further fields should be consulted.
Bit 62: OVER. Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the error-reporting register bank. May indicate that ESXi has not processed the MCE promptly, or that multiple MCEs occurred very close together.
Bit 61: UC. Indicates (when set) that the processor did not, or was not able to, correct the error condition. An ESXi host always generates a purple diagnostic screen when the processor indicates that the error condition wasuncorrectable.
Bit 60: EN. Indicates (when set) that the error was enabled by the associated EEjbit of theMCi_CTLregister. Will generally be 1.
Bit 59: MISCV. Indicates (when set) that the associated miscellaneous register (MCi_MISC) for this bank is valid, and contains additional information regarding the error.
Bit 58: ADDRV. Indicates (when set) that the associated address register (MCi_ADDR) for this bank is valid, and contains the memory address where the error occurred. Memory address may be physical or virtual, and dependenton the type of error encountered.
Bit 57: PCC. Indicates (when set) that the state of the processor may have been corrupted by the error condition, and that it may not be possible to reliably resume software execution.

Note: For more information, see the vendor documentation listed in the Additional Information section of this article.

Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific.

Bits 31:16 contain a model-specific extended error code.

Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance.

Machine-check architecture-defined error codes in the bank status (MCi_STATUS) register

The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.

Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:

0000 0000 0000 0000– No error has been reported to this bank.
0000 0000 0000 0001– Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.
0000 0000 0000 0010– Parity error in internal microcode ROM
0000 0000 0000 0011– TheBINT#from another processor caused this processor to enter machine-check.
0000 0000 0000 0100– Functional redundancy check (FRC) master/slave error.
0000 0000 0000 0101– Internal parity error.0000 0100 0000 0000– Internal timer error.
0000 01xx xxxx xxxx– Internal unclassified error. At least one x equals 1

Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:

000F 0000 0000 11LL– Generic cache hierarchy errors.
000F 0000 0001 TTLL– TLB errors.
000F 0000 1MMM CCCC– Memory controller errors (Intel-only).
000F 0001 RRRR TTLL– Memory errors in the cache hierarchy.
000F 1PPT RRRR IILL– Bus and interconnect errors.

Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:

Encoding of Transaction Type (TT) sub-field:

00– Instruction
01– Data
10– Generic
11– Reserved

Encoding of Memory Hierarchy Level (LL) sub-field:

00– Level 0
01– Level 1
10– Level 2
11– Generic

Encoding of memory transaction type (MMM) sub-field:

000– Generic undefined request
001– Memory read error
010– Memory write error.
011– Address or command error.
100– Memory scrubbing error.
101-111– Reserved.

Encoding of channel number (CCCC) sub-field:

0000-1110– Channel number.
1111– Channel not specified.

Encoding of Request (RRRR) sub-field:

0000– Generic error
0001– Generic read
0010– Generic write
0011– Data read
0100– Data write
0101– Instruction fetch
0110– Prefetch
0111– Evict
1000– Snoop (probe)

Encoding of Participation Processor (PP) sub-field:

00– Local node originated the request.
01– Local node responded to the request.
10– Local node observed error as third-party.
11– Generic

Encoding of Timeout (T) sub-field:

0– Request did not timeout.
1– Request did timeout.
Encoding of Memory/IO (II) sub-field:
00– Memory access
01– Reserved
10– I/O
11– Other

Model-specific error codes in the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers:

The machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.

To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor mode or contact the hardware vendor.

Other considerations:

Information reported by the machine-check architecture provides aid in troubleshooting a hardware issue. However, the information available from the MCA error code may be insufficient to root-cause the issue. If more information is required, refer to the processor documentation from the manufacturer.
Information reported by the machine-check architecture should be considered in context of other errors when attempting to determine a pattern of outages.
If the machine-check architecture reports invalid information, but an MCE has occurred, this is still reflective of a hardware fault.
Providing the full machine-check architecture register contents to the hardware vendor may assist their investigation into the cause of the hardware fault.

Certain kinds of machine check errors do not cause ESXi to panic. Some errors are completely corrected by hardware, such as memory errors that are corrected by Error Correcting Code (ECC) hardware, but the hardware may still report them to ESXi for advisory reasons.

Other errors cannot be corrected by hardware, but can still be recovered from by terminating the task that encountered the error. For example, when a memory error is too severe to be corrected by ECC hardware, it may still be possible for the system to terminate only the virtual machine or process that was using the corrupted data, while allowing other virtual machines and processes to continue running. In other cases, however, an error that is recoverable in theory cannot actually be recovered from because the ESXi kernel was using the corrupted data, so ESXi still must panic.

Both corrected errors and recoverable errors appear in the vmkernel log and can be decoded using the instructions in this article. If a virtual machine or other process had to be terminated as part of recovery, the details generally arelogged as well.

Additional Information

By default, the ESXi host vmkernel writes logs to /var/log/messages. These logs can be redirected to an alternate local path or they can be redirected to a remote host. For more information, see the Basic System Administration Guide for your version of ESXi (Embedded or Installable). If you require the support logs beyond the last reboot, it may be advisable to log to both a remote disk and a remote syslogd server.

For more information, see:
Intel - Chapters 15 and 16 of the Intel 64 and IA-32 Architectures Software Developer's Manual.

AMD - Chapter 9 of the AMD64 Architecture Programmer’s Manual

Interpreting an ESXi host purple diagnostic screen
Extracting the log file after an ESXi host fails with a purple screen error

^{Dell: VMware: Debugging ESXi Machine Check Exception (MCE) PSOD}

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No