ESXi host hangs when DIMMs emit a correctable ECC error storm
search cancel

ESXi host hangs when DIMMs emit a correctable ECC error storm

book

Article ID: 441951

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • An ESXi host suddenly stops responding to vCenter and is reported as "Not Responding" or "Disconnected."

  • The DCUI on the local console no longer accepts keyboard input, and the display does not refresh.

  • The host cannot be rebooted from the management UI and recovers only after a full power cycle through the server's baseboard management controller (BMC).

  • A Manual NMI sent from the hardware management interface either reports success without effect on the host, or does not produce a diagnostic memory dump on the ESXi side.

  • In /var/run/log/vmkernel.log, you see no Machine Check Exception (MCE), Corrected Machine Check Interrupt (CMCI), memory error, or panic entries in the period leading up to the hang.

  • In /var/run/log/vmksummary.log, you see hourly heartbeat lines stop at a single point in time and resume only after a manual power cycle, similar to:

    <date>T23:00:00.000Z In(14) heartbeat[XXXXXXX]: up XXdXXhXXmXXs, X VMs; ...
    <later-date>T15:59:09.000Z No(13) bootstop[XXXXXXX]: Host has booted
    
  • In /var/run/log/hostd.log and other userspace service logs, the last entries from the previous boot are routine activity with no error spike preceding the silence.

  • On the boot-time ApeiHEST line in /var/run/log/vmkernel.log, you see corrected machine check delivery is not allocated to the operating system, similar to:

    ApeiHEST: 984: GESB lists: 1 NMI, 0 MCE, 0 CMC, 0 DMC, 1 SCI, 0 VMWPR
    

    The 0 CMC value is the key indicator: it shows that the platform firmware is not routing corrected memory errors to the operating system as CMCI events.

  • In the server vendor's BMC log (collected through whichever hardware support log bundle the server vendor provides), you see a burst of correctable ECC events attributed to a single DIMM in the minutes preceding the host hang. The exact format varies by vendor but the structure is consistent: a timestamp, a slot identifier, an event type indicating "correctable" or "corrected" memory error, and an error count. A representative example:

    <date> 19:54:05 | Memory <slot-identifier> | read 50 correctable ECC errors on CPU<N> DIMM <slot>
    <date> 19:57:12 | Memory <slot-identifier> | read 925 correctable ECC errors on CPU<N> DIMM <slot>
    
  • In the server vendor's per-DIMM error counter database (where the vendor provides one), you see a non-zero correctable ECC count on exactly one DIMM slot, with every other populated slot reporting zero.

Additional symptoms reported:

  • The ESXi host appears "hung" with no input accepted on the keyboard.
  • The server's BMC shows the host as healthy or "OK" in the hardware management UI even while the ESXi host is unresponsive in vCenter.
  • Power-cycling the host brings it back to a healthy state.

Environment

  • VMware ESXi 7.0.x and 8.0.x
  • ESXi hosts on server platforms where corrected memory errors are handled by the platform BIOS through System Management Interrupts (SMI) rather than surfaced to the operating system as Corrected Machine Check Interrupts (CMCI). This routing policy is indicated by 0 CMC in the boot-time ApeiHEST line in vmkernel.log.

Cause

A single DIMM begins to emit single-bit memory errors at a sustained high rate. Each error is corrected by ECC at the hardware level, so no data is lost, but each correction generates a notification that the platform BIOS handles inside a System Management Interrupt (SMI). On platforms where corrected memory errors are not surfaced to the operating system as a CMCI event, the ESXi VMkernel has no visibility into the error stream and does not log it.

When the error rate is small, the SMI overhead is negligible. When the rate grows into the hundreds of errors per second, the affected CPU spends nearly all of its cycles inside the SMI handler, leaving the ESXi VMkernel and userspace services unable to make forward progress. The host appears hung to vCenter, the console freezes, and management agents stop responding.

Because the CPU is held in System Management Mode, it also cannot dispatch an NMI handler from the operating system. This is why a Manual NMI sent from the BMC does not produce a diagnostic dump even on hosts that are otherwise configured to panic on NMI.

The condition does not appear in vmkernel.log because the corrected errors never reach the kernel. It does appear in the server vendor's BMC logs, which is where the diagnostic evidence is collected from.

Resolution

  1. Collect the server vendor's hardware support log bundle from the affected host. The BMC is able to collect this bundle even when the ESXi host operating system is hung. Common examples include Cisco Intersight server tech-support, Dell iDRAC SupportAssist Collection, HPE Active Health System (AHS) log, and Lenovo XClarity Controller First Failure Data Capture (FFDC). On platforms without a vendor-specific collector, a raw IPMI System Event Log (SEL) export through ipmitool sel list or the vendor's equivalent is acceptable.

  2. Open the SEL or memory-error log inside the vendor bundle. Look for events of type "correctable ECC error," "corrected memory error," or the vendor's equivalent phrasing, attributed to a memory device. Record the slot designation (for example P1_F1DIMM_A1Proc 1 DIMM 1, depending on vendor naming convention), the timestamps, and the per-event error counts.

  3. Open the vendor's per-DIMM error counter database, if available. The principle is the same across vendors: a single-DIMM failure pattern shows a non-zero correctable error count on one slot with every other populated slot at zero. If multiple slots show non-zero error counts, this article does not apply, and a memory-controller or motherboard fault should be suspected instead.

  4. Map the failing slot to its physical part using the ESXi host's SMBIOS dump (commands/smbiosDump.txt inside the ESXi support bundle). Locate the Memory Device (Type 17) entry where Location matches the failing slot, and record the Part NumberSerial, and Asset Tag fields. These identify the physical DIMM for the hardware replacement.

  5. If the host is still hung, perform a cold power cycle through the BMC to return it to service. The cold boot resets DIMM training and BIOS error counters, which typically allows the host to come back up cleanly even with the failing DIMM still installed.

  6. Open a hardware service case with the server vendor's support organization to replace the failing DIMM. Provide:

    • The SEL or memory-error log entries from step 2
    • The per-DIMM error counter summary from step 3, if available
    • The DIMM part number and serial number from step 4
  7. Until the DIMM has been replaced, do not place production workload on the affected host. The failing module is still installed, and the same failure mode can recur at any time.

  8. After the DIMM has been replaced, set the advanced setting Misc.NMILint1IntAction to 1 (Panic) on the affected host. With the default value of 0, ESXi takes no action on receipt of an NMI, so a future NMI from the BMC cannot produce a diagnostic dump even on a healthy host. With the value 1, ESXi panics and writes a memory dump when an NMI is received.

    On platforms where corrected errors are handled by BIOS-SMI, the host CPU may still be unable to service an NMI handler if it is already saturated. The setting closes one of the two reasons an NMI may produce no dump but does not guarantee a dump under all hang conditions.

  9. Review the server vendor's automatic DIMM isolation feature with the vendor's support organization. Most current-generation server platforms expose some form of policy that allows the BMC to disable a DIMM with a high correctable error count at the next boot, which prevents recurrence on the same module without further operator intervention. The feature name varies by vendor (for example, Cisco UCS exposes "DIMM block-listing" through the service profile; other vendors use names such as "Memory Page Retire," "Bank Page Retire," "Reliable Memory," or "Advanced ECC with Page Retire"). The current policy state can be confirmed with the server vendor.

Additional Information

  • For an ESXi host hang with the same surface symptoms but a different software-side cause, see ESXi host becomes unresponsive and unrebootable - requires cold boot. That article applies when the trigger is the vmsyslogd signal-handler race. The article above applies when the trigger is a hardware memory-error storm visible in the server vendor's BMC logs.

  • For instructions on collecting an ESXi host support bundle for log review, see Collecting diagnostic information for VMware products.

  • For instructions on uploading collected logs to a Broadcom support case, see Uploading files to cases.

  • To improve diagnostic coverage in advance of a similar incident, set Misc.NMILint1IntAction = 1 on each ESXi host so that any future NMI from the BMC has the opportunity to produce a memory dump. Consider applying the setting through a host profile for fleet consistency.

  • The presence of 0 CMC in the boot-time ApeiHEST line in vmkernel.log is the platform-level signal that corrected memory errors are not visible to ESXi on the server in question. On platforms where this is the case, the server vendor's BMC logs are the authoritative source for corrected memory error data and should be reviewed in any unexplained ESXi hang investigation. On platforms where the value is non-zero (for example 1 CMC), the OS does receive corrected error notifications and vmkernel.log would normally contain MCE or CMCI entries; this article does not apply to such platforms.