Using hardware NMI facilities to troubleshoot unresponsive hosts

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about using Non-Maskable Interrupt (NMI) facilities to troubleshoot unresponsive VMware ESXi hosts.

This process should be followed if an ESXi host does not respond to interaction at the console or through the network and all hosted virtual machines do not respond to remote communication.
- For more information on these scenarios, see Determining why an ESXi host does not respond to user interaction at the console (341047).
  - If an ESXi host is responsive at the console, but not manageable remotely, see the following:
    - ESXi hosts do not respond and is grayed out in vCenter (345273)
      and
    - Troubleshooting an ESXi host in a "not responding" state (344682).

Caution: The following process is designed to force the ESXi host to halt with a purple diagnostic screen. If the ESXi host is responding sufficiently to run virtual machines, triggering a purple diagnostic screen following this process abruptly powers down all the virtual machines running on this ESXi host.

If an NMI presents with unknown origin, see "LINT1 motherboard interrupt" error in an ESXi host (333947).

Environment

VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

Resolution

NMI Overview

A Non-Maskable Interrupt (NMI) is a hardware interrupt that cannot be ignored by the processor. These types of interrupts are usually reserved for very important tasks and to report hardware errors to the processor.

Depending on the make and model of the system, it may be able to deliberately send an NMI to the CPUs.
By sending an NMI to the processor, it is forced to switch CPU context to the registered non-maskable interrupt handler. The interrupt cannot be ignored (masked). The operating system can handle the NMI based on prior configuration.

An intentionally triggered NMI can help to highlight:

Whether a CPU is capable of servicing interrupts.
Whether an operating system process or task is continuously looping on the CPU.

Note:
Some servers have a BIOS or BMC option to automatically reboot the system whenever a Non-Maskable Interrupt occurs.
If such a reboot occurs it implies that the hardware is operating correctly but does not provide enough information to troubleshoot the root cause of the issue. Disable the option.

NMIs and VMware ESXi

In some cases, it may be required for the ESXi host to generate a purple diagnostic screen and core dump to further troubleshoot an issue.

The VMkernel may break out of any continuously looping process on the CPU and log the NMI.
As each kernel receives the NMI, it can be configured to respond to an NMI by generating a purple diagnostic screen.

The VMkernel handles an NMI directly and generates a purple diagnostic screen.

If a purple diagnostic screen is triggered, a coredump from the VMkernel is saved.
Ensure that the ESXi host is correctly configured to capture VMkernel coredumps.
- For more information, see:
  - Configuring an ESXi host to capture a VMkernel coredump from a purple diagnostic screen (319635)
  - Configuring a diagnostic coredump partition on an ESXi host (319492)
It is possible for third-party OEM NMI drivers to intentionally initiate halting with a purple diagnostic screen upon receipt of an NMI regardless of the configured option.
- For more information, see Understanding the message: Panic requested by one or more 3rd party NMI handlers (310860).

Configuring the ESXi VMkernel to generate a purple diagnostic screen on NMI

The VMkernel option Misc.NMILint1IntAction has 4 possible values:

Enter debugger on hardware NMI.
Panic on hardware NMI, halting the VMkernel with a purple diagnostic screen.
log and ignore (not recommended)
log and ignore if undiagnosed

Note: If an ESXi host is unresponsive very early in the boot process, the VMkernel boot option VMkernel.Boot.nmiAction should be utilized instead. The default of 0 defers to the Misc.NMILint1IntAction option later in the boot process.

To configure the VMkernel to generate a purple diagnostic screen upon receiving an NMI, set the advanced option Misc.NMILint1IntAction to 2. For more information, see Configuring advanced options for ESXi (310338).

Note: The ESXi host must be rebooted for the change to effect.

Preparing to reproduce the issue

If an ESXi host was not configured appropriately prior to the outage, the issue must be reproduced before information about the unresponsive state is obtained.

Collect performance data leading up to the outage. For more information, see Using performance collection tools to gather data for fault analysis (308926).
Press Alt+F12 on the console to display the VMkernel logs on the screen. Leave these logs scrolling, they may be useful if the keyboard becomes unresponsive when the outage reoccurs.
It is required know how to send an NMI on the particular hardware server system. For examples, see the Additional Information section.

Results and next steps

At the time of the next outage, re-check the symptoms described in Determining why an ESXi host does not respond to user interaction at the console (341047) to ensure the same symptoms are observed.

If the server is completely unresponsive to keyboard input and network traffic, take a screenshot or photograph of the VMkernel logs. Check whether the VMkernel logs are continuing to scroll on the screen or whether they have frozen. When the events have been recorded, press the NMI button on the physical server or through the remote hardware management interface.

At this point, the server displays one of these symptoms:

The VMware ESXi host continues to be unresponsive and nothing is logged.

The hardware is completely unresponsive and does not react in any way to the NMI despite configuring the operating system software to respond accordingly. Engage the hardware vendor and consider using vendor-suggested hardware diagnostic software to run intensive hardware diagnostics for a prolonged period of time. If the hardware vendor does not suggest software, consider using the open-source MemTest86+**.
The VMware ESXi host abruptly reboots.

The hardware was able to service the interrupt, but may have initiated the restart itself. Some servers have a BIOS option to automatically reboot the system whenever a Non-Maskable Interrupt occurs. This implies the hardware may be operating correctly but does not provide enough information to proceed. Disable the BIOS option and repeat the test.
The VMware ESXi host logs NMI-related events but becomes unresponsive again.

The hardware is responsive and the ESXi kernel was capable of handling the interrupt and logging the event. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host and file a Support Request. For more information, see Collecting diagnostic information for VMware products (367431) and How to File a Support Request (142884).
The VMware ESXi host logs NMI-related events and becomes responsive.

The hardware is responsive and the ESXi kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. Although unlikely, this may occur if a driver or other process was stuck in an operational instruction loop. Review the VMkernel logs for Lint N or NMI events and any logs leading up to the outage. If the specific error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host and file a Support Request. For more information, see Collecting diagnostic information for VMware products (367431) and How to File a Support Request (142884).
The VMware ESXi host displays a purple diagnostic screen on the console.

The hardware is responsive and the ESXi kernel was capable of handling the interrupt. This usually indicates a software issue as the cause. When the purple diagnostic screen displays Disk dump successful towards the bottom of its output, take a screenshot or photograph and reboot the host. If the error has not been documented within the Knowledge Base, collect diagnostic information from the ESXi host including the core dump files, and submit a Support Request.

Additional Information

Triggering the NMI

The NMI button or switch location varies depending on the hardware. A small set of examples are available:

IBM x3650 M2 – The NMI button is on the diagnostic panel. There may also be a Send NMI button in the RSA. For more information, see the x3650 M2 Installation and Users Guide.
HPe Proliant – Review HPe documentation for using ILO management interface to t send NMI to the system.
Dell R900 – The NMI button is on the front panel. For more information, see the R900 Systems Hardware Owner's Manual.
Fujitsu PRIMERGY Servers (RX/TX) - The NMI button is on the front of the server. For more information, see the Operating Manual for the PRIMERGY Servers. The manual can be found at the Fujitsu website.
1. Click [Industry standard servers] - [PRIMERGY Servers]
2. Select thePRIMERGY Servers from the pulldown menu. For example, [PRIMERGY RX Servers] - [PRIMERGY RX300 Sriese] - [PRIMERGY RX300 S7]
3. Download the Operating Manual and check for the NMI button location.
Cisco UCS – Consult Cisco documentation for process to issue NMI to systems:
- IPMI command – ipmitool -I lan -H <RemoteServerBMCAddress> -U <Username> -a chassis power diag
- UCSM command – diagnostic-interrupt

For information on how to trigger the NMI for a particular server system, consult the hardware vendor.