NMI Overview
A Non-Maskable Interrupt (NMI) is a hardware interrupt that cannot be ignored by the processor. These types of interrupts are usually reserved for very important tasks and to report hardware errors to the processor.
Depending on the make and model of the system, it may be able to deliberately send an NMI to the CPUs.
By sending an NMI to the processor, it is forced to switch CPU context to the registered non-maskable interrupt handler. The interrupt cannot be ignored (masked). The operating system can handle the NMI based on prior configuration.
An intentionally triggered NMI can help to highlight:
- Whether a CPU is capable of servicing interrupts.
- Whether an operating system process or task is continuously looping on the CPU.
Note:
Some servers have a BIOS or BMC option to automatically reboot the system whenever a Non-Maskable Interrupt occurs.
If such a reboot occurs it implies that the hardware is operating correctly but does not provide enough information to troubleshoot the root cause of the issue. Disable the option.
NMIs and VMware ESXi
In some cases, it may be required for the ESXi host to generate a purple diagnostic screen and core dump to further troubleshoot an issue.
The VMkernel may break out of any continuously looping process on the CPU and log the NMI.
As each kernel receives the NMI, it can be configured to respond to an NMI by generating a purple diagnostic screen.
The VMkernel handles an NMI directly and generates a purple diagnostic screen.
- If a purple diagnostic screen is triggered, a coredump from the VMkernel is saved.
- Ensure that the ESXi host is correctly configured to capture VMkernel coredumps.
- For more information, see:
- It is possible for third-party OEM NMI drivers to intentionally initiate halting with a purple diagnostic screen upon receipt of an NMI regardless of the configured option.
Configuring the ESXi VMkernel to generate a purple diagnostic screen on NMI
The VMkernel option Misc.NMILint1IntAction
has 4 possible values:
- Enter debugger on hardware NMI.
- Panic on hardware NMI, halting the VMkernel with a purple diagnostic screen.
- log and ignore (not recommended)
- log and ignore if undiagnosed
Note: If an ESXi host is unresponsive very early in the boot process, the VMkernel boot option VMkernel.Boot.nmiAction
should be utilized instead. The default of 0
defers to the Misc.NMILint1IntAction
option later in the boot process.
To configure the VMkernel to generate a purple diagnostic screen upon receiving an NMI, set the advanced option Misc.NMILint1IntAction
to 2. For more information, see Configuring advanced options for ESXi (310338).
Note: The ESXi host must be rebooted for the change to effect.
Preparing to reproduce the issue
If an ESXi host was not configured appropriately prior to the outage, the issue must be reproduced before information about the unresponsive state is obtained.
- Collect performance data leading up to the outage. For more information, see Using performance collection tools to gather data for fault analysis (308926).
- Recording logs externally through the serial port leading up to the outage. For more information, see Enabling serial-line logging for an ESXi host (344469).
- Press Alt+F12 on the console to display the VMkernel logs on the screen. Leave these logs scrolling, they may be useful if the keyboard becomes unresponsive when the outage reoccurs.
- It is required know how to send an NMI on the particular hardware server system. For examples, see the Additional Information section.
Results and next steps
At the time of the next outage, re-check the symptoms described in Determining why an ESXi host does not respond to user interaction at the console (341047) to ensure the same symptoms are observed.
If the server is completely unresponsive to keyboard input and network traffic, take a screenshot or photograph of the VMkernel logs. Check whether the VMkernel logs are continuing to scroll on the screen or whether they have frozen. When the events have been recorded, press the NMI button on the physical server or through the remote hardware management interface.
At this point, the server displays one of these symptoms:
- The VMware ESXi host continues to be unresponsive and nothing is logged.
The hardware is completely unresponsive and does not react in any way to the NMI despite configuring the operating system software to respond accordingly. Engage the hardware vendor and consider using vendor-suggested hardware diagnostic software to run intensive hardware diagnostics for a prolonged period of time. If the hardware vendor does not suggest software, consider using the open-source MemTest86+**.
- The VMware ESXi host abruptly reboots.
The hardware was able to service the interrupt, but may have initiated the restart itself. Some servers have a BIOS option to automatically reboot the system whenever a Non-Maskable Interrupt occurs. This implies the hardware may be operating correctly but does not provide enough information to proceed. Disable the BIOS option and repeat the test.
- The VMware ESXi host logs NMI-related events but becomes unresponsive again.
The hardware is responsive and the ESXi kernel was capable of handling the interrupt and logging the event. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host and file a Support Request. For more information, see Collecting diagnostic information for VMware products (367431) and How to File a Support Request (142884).
- The VMware ESXi host logs NMI-related events and becomes responsive.
The hardware is responsive and the ESXi kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host and file a Support Request. For more information, see Collecting diagnostic information for VMware products (367431) and How to File a Support Request (142884).
- The VMware ESXi host displays a purple diagnostic screen on the console.
The hardware is responsive and the ESXi kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. When the purple diagnostic screen displays Disk dump successful towards the bottom of its output, take a screenshot or photograph and reboot the host. If the error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host including the core dump files, and submit a Support Request.