Error: "NMI IPI: Panic requested by another PCPU" - ESXi host PSOD preceded by corrected memory errors
search cancel

Error: "NMI IPI: Panic requested by another PCPU" - ESXi host PSOD preceded by corrected memory errors

book

Article ID: 414711

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Primary PSOD Error

  • Error message: NMI IPI: Panic requested by another PCPU. PC [address], SP [address] (Src [value], CPU[number])
  • Backtrace functions:
    • VmMemPfRangeSetBackedByLPageUnmappedWork
    • VmMemPfRangeSetBackedByLPage
    • VmMemPfLockLargePageInt
    • VmMemPfLockLargePage

Pre-PSOD Symptoms: Corrected Memory Errors

on the ESXi host, /var/log/vmkernel.log entries:

  • Multiple corrected memory errors appear before the PSOD
  • Error format: MCA: 202: CE Poll G0 B8 [status codes] Memory Controller Read Error on Channel [X]
  • Timing: Errors occur rapidly over several seconds or minutes

CPU Heartbeat Failures

  • Multiple physical CPUs stop responding
  • Warning message: WARNING: Heartbeat: [ID]: PCPU [number] didn't have a heartbeat for [X] seconds, timeout is 10, [X] IPIs sent; *may* be locked up.

Timeline and Pattern

Three-stage pattern (distinguishing characteristic):

  1. Corrected memory errors begin
  2. CPU heartbeat failures occur
  3. PSOD follows within 4-10 minutes after memory errors start
  • Host may be idle with no virtual machines running
  • System becomes unresponsive and requires hard reset

Additional Symptoms Reported

  • PSOD on host
  • No VMs (Virtual Machines) running at the time
  • Host previously experienced similar issue resolved with memory replacement

 

Environment

ESXi 7.0 or newer

Cause

The system's memory module (DIMM) is experiencing hardware degradation. Physical memory must reliably store and retrieve data without errors. When a memory module begins to fail, the memory controller detects read/write errors.

The memory controller uses ECC (Error Correcting Code) to automatically correct these errors. These corrections are logged as CE (Corrected Error) events in vmkernel.log. The message format is `MCA: 202: CE Poll` followed by `Memory Controller Read Error on Channel [X]`.

Individual corrected errors are handled transparently. However, a high rate of corrections indicates the memory module can no longer maintain data integrity. Dozens of errors occurring within seconds show the module is failing.

When corrected errors occur rapidly, the CPU spends significant time performing error correction operations. This causes CPU processing delays. Normal heartbeat monitoring cannot complete within expected timeframes. The system logs `PCPU [number] didn't have a heartbeat` warnings.

During this memory instability, the VMkernel attempts memory management operations. Specifically, large page backing operations handled by VmMemPfLockLargePage and related functions fail. The system encounters errors accessing the degraded memory regions.

The system cannot safely continue operation with unstable memory during critical tasks. It starts an NMI panic to halt operations and preserve diagnostic information. This panic with message `NMI IPI: Panic requested by another PCPU` is a protective response. It prevents potential data corruption by stopping all system activity when memory reliability is compromised.

Resolution

1. Review vmkernel.log for Memory Errors

  • Review the /var/run/log/vmkernel.log file

  • Identify corrected memory error entries with format: MCA: 202: CE Poll G0 B8 [status] Memory Controller Read Error on Channel [X]

2. Identify the Failing Memory Channel

  • Determine which memory channel is reporting errors by examining the channel number in error messages

  • Note if all errors occur consistently on the same channel

3. Assess Error Frequency

  • Count the frequency of corrected errors

  • Critical threshold: If you observe 10 or more CE errors within a few minutes on the same channel, this indicates imminent memory module failure

Gather the following for your hardware vendor:

  • vmkernel.log file showing corrected memory error messages

  • Memory channel number identified in step 2

  • VMkernel crash dump file (vmkernel-zdump) if available

  • PSOD screen text or screenshot showing the backtrace

5. Contact Hardware Vendor

  • Contact your hardware vendor support with the diagnostic information collected in step 4

  • Request memory diagnostics and replacement of the failing memory module on the identified channel

6. Replace Failing Memory Module

  • Follow your hardware vendor's guidance to identify the specific DIMM slot corresponding to the reported memory channel

  • Replace the failing memory module

7. Post-Replacement Monitoring

  • After memory replacement, monitor /var/log/vmkernel.log for 48-72 hours

  • Verify no new corrected memory error messages appear

Additional Information

Understanding Memory Errors

For more information about interpreting MCA (Machine Check Architecture) error messages in ESXi logs, see Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen).


Related Memory Error Articles

This article addresses ESXi hosts that experience corrected memory errors followed by NMI IPI panic. The backtrace includes VmMemPf functions. For related scenarios, see:

Differentiating NMI IPI Panic Scenarios

This issue is distinguished by multiple corrected memory errors immediately preceding the panic. The backtrace also contains VmMemPfLockLargePage functions. Other NMI IPI panic scenarios have different root causes:

VMFS-related panics:

Memory allocation panics:

vMotion-related panics:

Replication-related panics:

When to use this article:

  • Your backtrace includes VmMemPfLockLargePage or VmMemPfRangeSetBackedByLPage functions
  • Corrected memory errors appear in vmkernel.log before the panic

If these conditions are not met, refer to the articles above for other NMI IPI panic scenarios.