Determining why an ESXi host does not respond to user interaction at the console

Products

VMware vSphere ESXi VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0 VMware vSphere ESXi 6.0

Issue/Introduction

An ESXi host is not reachable via the network with the vSphere Client
An ESXi host is not reachable via the network with ping
Virtual machines running on the ESXi host are not reachable via the network
vCenter Server reports the host as Not Responding
The ESXi host does not respond to local commands or input at the console
Pressing Alt + F12 at the console does not switch to the VMkernel log display

Resolution

A number of factors can cause an ESXi host to become unresponsive. For example:

Defective or unresponsive hardware
An operational busy loop in the VMkernel, driver module, or service console
A component holding a lock needed by other components
A process that is consuming a high amount of resources

Troubleshooting this type of issue after it has occurred is difficult because interacting with the ESXi host while it is in this state is not allowed.

Note: Many external influences may yield similar symptoms but have very different underlying issues. For example, a network outage can result in a situation where an ESXi host and all running virtual machines become unresponsive, console authentication using remote directory services fails, and remote BMC management fails.

These limitations further complicate troubleshooting:

If the issue has only occurred once, analysis is limited to the logs generated prior to the single occurrence.
If the issue has only occurred once, identifying patterns between multiple occurrences is not permissible.
The logs generated by a single event may not be conclusive, and determining the root cause may not be possible.

If an ESXi host is currently in an unresponsive state, gather this information:

Press the NumLock key on the keyboard and observe if the NumLock light state changes. A successful light state change indicates that the BIOS is responsive.
Check if there is any active disk or network traffic using status lights or other hardware monitoring on the disk drive array, network interface cards or upstream switches. Active egress traffic indicates that the ESXi host is still functioning.
VMware HA monitors ESXi host availability in part based on response to ICMP (ping) network traffic. If the ESXi host is a member of an HA cluster, check the logs on other cluster members to determine when or if they lost access to this host.
Trigger an NMI at the hardware level and observe how ESXi responds. For more information, see Using hardware NMI facilities to troubleshoot unresponsive hosts. If a purple diagnostic screen occurs after triggering the NMI, take a screenshot.
Attempt to interact with the server via a baseboard management controller (BMC) interface, such as ILO, DRAC or RSA. If aspects of this interface other than the console are also unresponsive, it indicates that the issue is hardware related.
Reboot the ESXi host.
Collect diagnostic information from the host for further analysis.

If the issue is reproducible or occurs regularly, follow these steps to collect more data:

Setup top and esxtop in batch mode to collect performance data on the server leading up to the event.
Configure the system to fail with a purple screen error after receiving an NMI generated manually from the hardware. For more information on changing how an ESXi host reacts to an NMI, see Using hardware NMI facilities to troubleshoot unresponsive hosts. For more information on triggering an NMI at the hardware level, contact the hardware vendor.

Note: If the ESXi host does not respond after generating a hardware NMI then the issue is likely due to unresponsive hardware. Contact the hardware vendor for further assistance in troubleshooting this issue.
Collect the logs and performance data for further analysis.
Ask the following questions. The answers may help determine the cause of the issue.
- How many times has the ESXi host experienced this condition?
- What were the exact times and dates that the host became unresponsive?
- Have any other hosts experienced this issue?
- What else was happening in the environment at the time of the events?
- Is there a pattern to the times when the host becomes unresponsive?
- Are there any regularly scheduled jobs running when the host becomes unresponsive?