Error: "NMI IPI: Panic requested by another PCPU" - ESXi host PSOD preceded by corrected memory errors

Products

VMware vSphere ESXi

Issue/Introduction

Primary PSOD Error

Error message: NMI IPI: Panic requested by another PCPU. PC [address], SP [address] (Src [value], CPU[number])
Backtrace functions:
- VmMemPfRangeSetBackedByLPageUnmappedWork
- VmMemPfRangeSetBackedByLPage
- VmMemPfLockLargePageInt
- VmMemPfLockLargePage

Pre-PSOD Symptoms: Corrected Memory Errors

on the ESXi host, /var/log/vmkernel.log entries:

Multiple corrected memory errors appear before the PSOD
Error format: MCA: 202: CE Poll G0 B8 [status codes] Memory Controller Read Error on Channel [X]
Timing: Errors occur rapidly over several seconds or minutes

CPU Heartbeat Failures

Multiple physical CPUs stop responding
Warning message: WARNING: Heartbeat: [ID]: PCPU [number] didn't have a heartbeat for [X] seconds, timeout is 10, [X] IPIs sent; *may* be locked up.

Timeline and Pattern

Three-stage pattern (distinguishing characteristic):

Corrected memory errors begin
CPU heartbeat failures occur
PSOD follows within 4-10 minutes after memory errors start

Host may be idle with no virtual machines running
System becomes unresponsive and requires hard reset

Additional Symptoms Reported

PSOD on host
No VMs (Virtual Machines) running at the time
Host previously experienced similar issue resolved with memory replacement

Environment

ESXi 7.0 or newer

Cause

The system's memory module (DIMM) is experiencing hardware degradation. Physical memory must reliably store and retrieve data without errors. When a memory module begins to fail, the memory controller detects read/write errors.

The memory controller uses ECC (Error Correcting Code) to automatically correct these errors. These corrections are logged as CE (Corrected Error) events in vmkernel.log. The message format is `MCA: 202: CE Poll` followed by `Memory Controller Read Error on Channel [X]`.

Individual corrected errors are handled transparently. However, a high rate of corrections indicates the memory module can no longer maintain data integrity. Dozens of errors occurring within seconds show the module is failing.

When corrected errors occur rapidly, the CPU spends significant time performing error correction operations. This causes CPU processing delays. Normal heartbeat monitoring cannot complete within expected timeframes. The system logs `PCPU [number] didn't have a heartbeat` warnings.

During this memory instability, the VMkernel attempts memory management operations. Specifically, large page backing operations handled by VmMemPfLockLargePage and related functions fail. The system encounters errors accessing the degraded memory regions.

The system cannot safely continue operation with unstable memory during critical tasks. It starts an NMI panic to halt operations and preserve diagnostic information. This panic with message `NMI IPI: Panic requested by another PCPU` is a protective response. It prevents potential data corruption by stopping all system activity when memory reliability is compromised.

Resolution

1. Review vmkernel.log for Memory Errors

Review the /var/run/log/vmkernel.log file
Identify corrected memory error entries with format: MCA: 202: CE Poll G0 B8 [status] Memory Controller Read Error on Channel [X]

2. Identify the Failing Memory Channel

Determine which memory channel is reporting errors by examining the channel number in error messages
Note if all errors occur consistently on the same channel

3. Assess Error Frequency

Count the frequency of corrected errors
Critical threshold: If you observe 10 or more CE errors within a few minutes on the same channel, this indicates imminent memory module failure

Gather the following for your hardware vendor:

vmkernel.log file showing corrected memory error messages
Memory channel number identified in step 2
VMkernel crash dump file (vmkernel-zdump) if available
PSOD screen text or screenshot showing the backtrace

5. Contact Hardware Vendor

Contact your hardware vendor support with the diagnostic information collected in step 4
Request memory diagnostics and replacement of the failing memory module on the identified channel

6. Replace Failing Memory Module

Follow your hardware vendor's guidance to identify the specific DIMM slot corresponding to the reported memory channel
Replace the failing memory module

7. Post-Replacement Monitoring

After memory replacement, monitor /var/log/vmkernel.log for 48-72 hours
Verify no new corrected memory error messages appear

Additional Information

Understanding Memory Errors

For more information about interpreting MCA (Machine Check Architecture) error messages in ESXi logs, see Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen).

Related Memory Error Articles

This article addresses ESXi hosts that experience corrected memory errors followed by NMI IPI panic. The backtrace includes VmMemPf functions. For related scenarios, see:

ESXi Host Becomes Unresponsive Due to Memory Controller Errors - for systems with corrected memory errors causing unresponsiveness without PSOD
VMkernel reports corrected memory errors and retirement of memory pages - for information about memory page retirement behavior

Differentiating NMI IPI Panic Scenarios

This issue is distinguished by multiple corrected memory errors immediately preceding the panic. The backtrace also contains VmMemPfLockLargePage functions. Other NMI IPI panic scenarios have different root causes:

VMFS-related panics:

ESXi 7.0 Update 3 host fails with a backtrace NMI IPI: Panic requested by another PCPU - backtrace shows VMFS metadata operations (HeapVSIAddChunkInfo, J6_NewOnDiskTxn). Caused by 2GB UNMAP requests. No memory errors present.

Memory allocation panics:

"FastSlabAllocSlow", ESXi 7.0 host fails with a PSOD (purple screen of death) "NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS" - backtrace shows FastSlabAllocSlow during memory allocation. No corrected errors present.

vMotion-related panics:

ESXi Host PSOD or Virtual Machine crash after VMs vMotion due to a race condition - backtrace shows VmMemPfCompressed functions related to memory compression during vMotion. No hardware errors present.

Replication-related panics:

ESXi host may crash with a PSOD - Spin count exceeded - possible deadlock with PCPU - backtrace shows BitVector operations during vSphere Replication. No memory errors present.

When to use this article:

Your backtrace includes VmMemPfLockLargePage or VmMemPfRangeSetBackedByLPage functions
Corrected memory errors appear in vmkernel.log before the panic

If these conditions are not met, refer to the articles above for other NMI IPI panic scenarios.