When extracting the logs from the core dump, and possibly also on the purple diagnostic screen, you see a message similar to:
The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a Machine Check Exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen displays, a hardware problem has caused it. There is no other way to generate an MCE.
When the system has faults with a purple screen, capture the screen output, then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.
Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors.
When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESXi halts with a purple diagnostic screen.
During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. For more information, see Extracting the log file after an ESX or ESXi host fails with a purple screen error (310769). If serial-line logging is configured, the same messages are emitted on the serial port. For more information, see Enabling serial-line logging for an ESX and ESXi host (344469).
The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.
The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.
Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.
Different versions of ESXi log the machine-check architecture register contents using different formats. For more information, see Determining VMware Software Version and Build Number (320235).
Regardless of the version of ESXi, these items of information should be available:
The log message consists of one line for each bank of interest, including the physical CPU number, the text "MCA:", the error class, how the error was reported, the MCG_STATUS register (G), the bank number (B), the MCi_STATUS register (S), the MCi_ADDR register (A), the MCi_MISC register (M), the decoded system physical address and size (P) in 6.7 and later, and a human-readable interpretation of the error.
cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/40 Cache Hierarchy: Level 0 Data Cache Read Error.
The error class may be one of the following:
How the error was reported may be one of the following:
The log message consists of one line for each bank of interest, including the text "MC:", the physical cpu number (PCPU), the bank number (B), the MCi_STATUS register (S), the MCi_MISC register (M), the MCi_ADDR register (A), and the MCG_STATUS register.
MC:PCPU42 B:4 S:0xbe00000000800400 M:0x41800d55315c A:0x41800d55315c 5
The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.
63 | 3 | 2 | 1 | 0 |
Reserved | MCIP | EIPV | RIPV |
For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.
Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 32 | 31 | 16 | 15 | 0 |
VAL | OVER | UC | EN | MISCV | ADDRV | PCC | Other Information | Extended Error Code | MCA Error Code |
Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.
Bits 31:16 contain a model-specific extended error code. For more information, see the vendor documentation listed in the Additional Information section of this article.
Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance. For more information, see the vendor documentation listed in the Additional Information section of this article.
The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.
Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:
0000 0000 0000 0000
– No error has been reported to this bank.0000 0000 0000 0001
– Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.0000 0000 0000 0010
– Parity error in internal microcode ROM0000 0000 0000 0011
– The BINT# from another processor caused this processor to enter machine-check.0000 0000 0000 0100
– Functional redundancy check (FRC) master/slave error.0000 0000 0000 0101
– Internal parity error.0000 0100 0000 0000
– Internal timer error.0000 01xx xxxx xxxx
– Internal unclassified error. At least one x equals 1Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:
000F 0000 0000 11LL
– Generic cache hierarchy errors.000F 0000 0001 TTLL
– TLB errors.000F 0000 1MMM CCCC
– Memory controller errors (Intel-only).000F 0001 RRRR TTLL
– Memory errors in the cache hierarchy.000F 1PPT RRRR IILL
– Bus and interconnect errors.Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:
00
– Instruction01
– Data10
– Generic11
– Reserved00
– Level 001
– Level 110
– Level 211
– Generic000
– Generic undefined request001
– Memory read error010
– Memory write error.011
– Address or command error.100
– Memory scrubbing error.101-111
– Reserved.0000-1110
– Channel number.1111
– Channel not specified.0000
– Generic error0001
– Generic read0010
– Generic write0011
– Data read0100
– Data write0101
– Instruction fetch0110
– Prefetch0111
– Evict1000
– Snoop (probe)00
– Local node originated the request.01
– Local node responded to the request.10
– Local node observed error as third-party.11
– Generic0
– Request did not timeout.1
– Request did timeout.00
– Memory access01
– Reserved10
– I/O11
– OtherThe machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.
To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model. For more information, see the vendor documentation listed in the Additional Information section of this article or contact the hardware vendor.
By default, the ESXi host vmkernel writes logs to /var/log/messages. These logs can be redirected to an alternate local path or they can be redirected to a remote host. For more information, see the Basic System Administration Guide for your version of ESXi (Embedded or Installable). If you require the support logs beyond the last reboot, it may be advisable to log to both a remote disk and a remote syslogd server.
“vm-support” command in ESX/ESXi to collect diagnostic information
Collecting diagnostic information for VMware ESX/ESXi
https://www.dell.com/support/kbdoc/en-us/000215212/ax750-os-issue-psp