This article provides information about using Non-Maskable Interrupt (NMI) facilities to troubleshoot unresponsive VMware ESXi hosts.
Caution: The following process is designed to force the ESXi host to halt with a purple diagnostic screen. If the ESXi host is responding sufficiently to run virtual machines, triggering a purple diagnostic screen following this process abruptly powers down all the virtual machines running on this ESXi host.
A Non-Maskable Interrupt (NMI) is a hardware interrupt that cannot be ignored by the processor. These types of interrupts are usually reserved for very important tasks and to report hardware errors to the processor.
Depending on the make and model of the system, it may be able to deliberately send an NMI to the CPUs.
By sending an NMI to the processor, it is forced to switch CPU context to the registered non-maskable interrupt handler. The interrupt cannot be ignored (masked). The operating system can handle the NMI based on prior configuration.
An intentionally triggered NMI can help to highlight:
Note:
Some servers have a BIOS or BMC option to automatically reboot the system whenever a Non-Maskable Interrupt occurs.
If such a reboot occurs it implies that the hardware is operating correctly but does not provide enough information to troubleshoot the root cause of the issue. Disable the option.
In some cases, it may be required for the ESXi host to generate a purple diagnostic screen and core dump to further troubleshoot an issue.
The VMkernel may break out of any continuously looping process on the CPU and log the NMI.
As each kernel receives the NMI, it can be configured to respond to an NMI by generating a purple diagnostic screen.
The VMkernel handles an NMI directly and generates a purple diagnostic screen.
The VMkernel option Misc.NMILint1IntAction
has 4 possible values:
Note: If an ESXi host is unresponsive very early in the boot process, the VMkernel boot option VMkernel.Boot.nmiAction
should be utilized instead. The default of 0
defers to the Misc.NMILint1IntAction
option later in the boot process.
To configure the VMkernel to generate a purple diagnostic screen upon receiving an NMI, set the advanced option Misc.NMILint1IntAction
to 2. For more information, see Configuring advanced options for ESXi (310338).
Note: The ESXi host must be rebooted for the change to effect.
If an ESXi host was not configured appropriately prior to the outage, the issue must be reproduced before information about the unresponsive state is obtained.
At the time of the next outage, re-check the symptoms described in Determining why an ESXi host does not respond to user interaction at the console (341047) to ensure the same symptoms are observed.
If the server is completely unresponsive to keyboard input and network traffic, take a screenshot or photograph of the VMkernel logs. Check whether the VMkernel logs are continuing to scroll on the screen or whether they have frozen. When the events have been recorded, press the NMI button on the physical server or through the remote hardware management interface.
At this point, the server displays one of these symptoms:
The NMI button or switch location varies depending on the hardware. A small set of examples are available:
ipmitool -I lan -H <RemoteServerBMCAddress> -U <Username> -a chassis power diag
For information on how to trigger the NMI for a particular server system, consult the hardware vendor.