Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)
search cancel

Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)

book

Article ID: 367928

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • An ESXi host halts with a purple diagnostic screen
  • The purple diagnostic screen shows a message similar to:
Machine Check Exception on PCPU42 in world 10021342342
System has encountered a Hardware Error - Please contact the hardware vendor
  • When extracting the logs from the core dump, and possibly also on the purple diagnostic screen, you see a message similar to:

    • ESXi 6.5.x or later:
    • cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/4
    • ESXi 6.0.x:
    • MC:PCPU42 B:4 S:0xbe00000000800400 M:0x41800d55315c A:0x41800d55315c 5

 

The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a Machine Check Exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen displays, a hardware problem has caused it. There is no other way to generate an MCE.

When the system has faults with a purple screen, capture the screen output, then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.

When you see an MCE purple diagnostic screen, take a screenshot, reboot, and collect the logs. .
Note: If you experience a purple diagnostic screen which does not mention MC, Machine Check Exception, or Hardware (Machine) Error, see Interpreting an ESX host purple diagnostic screen (343033).

Resolution

Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors.

When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESXi halts with a purple diagnostic screen.

During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. For more information, see Extracting the log file after an ESX or ESXi host fails with a purple screen error (310769). If serial-line logging is configured, the same messages are emitted on the serial port. For more information, see Enabling serial-line logging for an ESX and ESXi host (344469).

Machine-Check Architecture Registers:

The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.

The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.

Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.

Identifying register contents

Different versions of ESXi log the machine-check architecture register contents using different formats. For more information, see Determining VMware Software Version and Build Number (320235).

Regardless of the version of ESXi, these items of information should be available:

  • Physical CPU number
  • Global status register
  • Bank number
  • Bank status register
  • Bank address register
  • Bank miscellaneous register

ESXi 6.5 and later:

The log message consists of one line for each bank of interest, including the physical CPU number, the text "MCA:", the error class, how the error was reported, the MCG_STATUS register (G), the bank number (B), the MCi_STATUS register (S), the MCi_ADDR register (A), the MCi_MISC register (M), the decoded system physical address and size (P) in 6.7 and later, and a human-readable interpretation of the error.
cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/40 Cache Hierarchy: Level 0 Data Cache Read Error.

The error class may be one of the following:

  • UC: Uncorrected, unrecoverable
  • SRAR: Uncorrected, recoverable, action required (Intel)
  • SRAO: Uncorrected, recoverable, action optional (Intel)
  • UCNA: Uncorrected, no action required (Intel)
  • UCR: Uncorrected, recoverable (AMD)
  • CE: Corrected
  • DE: Deferred (AMD)

How the error was reported may be one of the following:

  • Init: Found during boot-time initialization (possibly from prior to the reboot)
  • Poll: Periodic polling of the MCA banks
  • Excp: Machine Check Exception handler
  • Intr: Corrected Machine Check Interrupt handler

ESXi 6.0:

The log message consists of one line for each bank of interest, including the text "MC:", the physical cpu number (PCPU), the bank number (B), the MCi_STATUS register (S), the MCi_MISC register (M), the MCi_ADDR register (A), and the MCG_STATUS register.
MC:PCPU42 B:4 S:0xbe00000000800400 M:0x41800d55315c A:0x41800d55315c 5

Automatic Interpretation:

VMware ESXi attempts to interpret the contents of the status register(s) for display in the log and on the purple diagnostic screen.

For example:
  • Cache Hierarchy: Level 0 Data Cache Read Error.
  • Bus error, node originated, read, memory access
Note: Where the automatic interpretation and vendor interpretation disagree, the interpretation of the vendor should be taken as correct. The raw contents of the status registers are also available, so they can be manually reviewed.

Decoding the global MCA status (MCG_STATUS) register

The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.

63 3 2 1 0
Reserved MCIP EIPV RIPV
  • Bit 2: Machine Check In Progress. Identifies whether a machine check is in progress, and whether further fields should be consulted.
  • Bit 1: Error IP Valid. Identifies whether the instruction pointer pushed on to the stack is directly related to the error.
  • Bit 0: Restart IP Valid. Identifies whether the program execution can be reliably restarted at the instruction pointer pushed on to the stack.

For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.

Overview of the bank status (MCi_STATUS) register

Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.

63 62 61 60 59 58 57 56 32 31 16 15 0
VAL OVER UC EN MISCV ADDRV PCC Other Information Extended Error Code MCA Error Code
 
The high 7 bits 57:63 provide an overview of the processor state, and which of the other registers are meaningful:
  • Bit 63: VAL. Indicates (when set) that this bank's status (MCi_STATUS) register is valid, and that further fields should be consulted.
  • Bit 62: OVER. Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the error-reporting register bank. May indicate that ESXi has not processed the MCE promptly, or that multiple MCEs occurred very close together.
  • Bit 61: UC. Indicates (when set) that the processor did not, or was not able to, correct the error condition. An ESXi host always generates a purple diagnostic screen when the processor indicates that the error condition was uncorrectable.
  • Bit 60: EN. Indicates (when set) that the error was enabled by the associated EEj bit of the MCi_CTL register. Will generally be 1.
  • Bit 59: MISCV. Indicates (when set) that the associated miscellaneous register (MCi_MISC) for this bank is valid, and contains additional information regarding the error.
  • Bit 58: ADDRV. Indicates (when set) that the associated address register (MCi_ADDR) for this bank is valid, and contains the memory address where the error occurred. Memory address may be physical or virtual, and dependent on the type of error encountered.
  • Bit 57: PCC. Indicates (when set) that the state of the processor may have been corrupted by the error condition, and that it may not be possible to reliably resume software execution.

Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.

Bits 31:16 contain a model-specific extended error code. For more information, see the vendor documentation listed in the Additional Information section of this article.

Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance. For more information, see the vendor documentation listed in the Additional Information section of this article.

Machine-check architecture-defined error codes in the bank status (MCi_STATUS) register

The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.

Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:

  • 0000 0000 0000 0000 – No error has been reported to this bank.
  • 0000 0000 0000 0001 – Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.
  • 0000 0000 0000 0010 – Parity error in internal microcode ROM
  • 0000 0000 0000 0011 – The BINT# from another processor caused this processor to enter machine-check.
  • 0000 0000 0000 0100 – Functional redundancy check (FRC) master/slave error.
  • 0000 0000 0000 0101 – Internal parity error.
  • 0000 0100 0000 0000 – Internal timer error.
  • 0000 01xx xxxx xxxx – Internal unclassified error. At least one x equals 1

Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:

  • 000F 0000 0000 11LL – Generic cache hierarchy errors.
  • 000F 0000 0001 TTLL – TLB errors.
  • 000F 0000 1MMM CCCC – Memory controller errors (Intel-only).
  • 000F 0001 RRRR TTLL – Memory errors in the cache hierarchy.
  • 000F 1PPT RRRR IILL – Bus and interconnect errors.

Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:

  • Encoding of Transaction Type (TT) sub-field:
     
    • 00 – Instruction
    • 01 – Data
    • 10 – Generic
    • 11 – Reserved
  • Encoding of Memory Hierarchy Level (LL) sub-field:
     
    • 00 – Level 0
    • 01 – Level 1
    • 10 – Level 2
    • 11 – Generic
  • Encoding of memory transaction type (MMM) sub-field:
     
    • 000 – Generic undefined request
    • 001 – Memory read error
    • 010 – Memory write error.
    • 011 – Address or command error.
    • 100 – Memory scrubbing error.
    • 101-111 – Reserved.
  • Encoding of channel number (CCCC) sub-field:
     
    • 0000-1110 – Channel number.
    • 1111 – Channel not specified.
  • Encoding of Request (RRRR) sub-field:
     
    • 0000 – Generic error
    • 0001 – Generic read
    • 0010 – Generic write
    • 0011 – Data read
    • 0100 – Data write
    • 0101 – Instruction fetch
    • 0110 – Prefetch
    • 0111 – Evict
    • 1000 – Snoop (probe)
  • Encoding of Participation Processor (PP) sub-field:
     
    • 00 – Local node originated the request.
    • 01 – Local node responded to the request.
    • 10 – Local node observed error as third-party.
    • 11 – Generic
  • Encoding of Timeout (T) sub-field:
     
    • 0 – Request did not timeout.
    • 1 – Request did timeout.
  • Encoding of Memory/IO (II) sub-field:
    • 00 – Memory access
    • 01 – Reserved
    • 10 – I/O
    • 11 – Other

Model-specific error codes in the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers:

The machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.

To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model. For more information, see the vendor documentation listed in the Additional Information section of this article or contact the hardware vendor.

Other considerations

  • Information reported by the machine-check architecture provides aid in troubleshooting a hardware issue. However, the information available from the MCA error code may be insufficient to root-cause the issue. If more information is required, refer to the processor documentation from the manufacturer.
     
  • Information reported by the machine-check architecture should be considered in context of other errors when attempting to determine a pattern of outages.
     
  • If the machine-check architecture reports invalid information, but an MCE has occurred, this is still reflective of a hardware fault.
     
  • Providing the full machine-check architecture register contents to the hardware vendor may assist their investigation into the cause of the hardware fault.
Workaround
For known issues/articles which do not have a resolution, add workaround information in this section.
 
Related Information
Include background information. Include references to other KB articles, documentation, guides that are related to the content of the article. For known issue articles, include the following standard sentence in this section.
Some kinds of machine check errors do not cause ESXi to panic.

Corrected errors:

 Some errors are completely corrected by hardware, such as memory errors that are corrected by Error Correcting Code (ECC) hardware, but the hardware may still report them to ESXi for advisory reasons.

Recoverable errors:

Some errors cannot be corrected by hardware, but can still be recovered from by terminating the task that encountered the error. For example, when a memory error is too severe to be corrected by ECC hardware, it may still be possible for the system to terminate only the virtual machine or process that was using the corrupted data, while allowing other virtual machines and processes to continue running. In other cases, however, an error that is recoverable in theory cannot actually be recovered from because the ESXi kernel was using the corrupted data, so ESXi still must panic.

Both corrected errors and recoverable errors appear in the vmkernel log and can be decoded using the instructions in this article. If a virtual machine or other process had to be terminated as part of recovery, the details generally are logged as well.

For more information, see:
Enabling serial-line logging for an ESXi/ESXi host
Interpreting an ESX/ESXi host purple diagnostic screen
Extracting the log file after an ESX or ESXi host fails with a purple screen error

Additional Information

By default, the ESXi host vmkernel writes logs to /var/log/messages. These logs can be redirected to an alternate local path or they can be redirected to a remote host. For more information, see the Basic System Administration Guide for your version of ESXi (Embedded or Installable). If you require the support logs beyond the last reboot, it may be advisable to log to both a remote disk and a remote syslogd server.
“vm-support” command in ESX/ESXi to collect diagnostic information
Collecting diagnostic information for VMware ESX/ESXi

https://www.dell.com/support/kbdoc/en-us/000215212/ax750-os-issue-psp