ESXi Host Intermittently unresponsive with no logging of events

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts becoming unresponsive or freezing unexpectedly
Virtual machines (VMs) running on the affected hosts becoming inaccessible
Management interfaces, such as vSphere Client or SSH, becoming unresponsive
Gaps in logging data, with no entries recorded during the period of unresponsiveness
Inability to restart services or reboot the affected ESXi hosts gracefully

In some cases, ESXi hosts may become completely unresponsive, causing all processes to hang without any logging of events. This can make troubleshooting the issue challenging, as there is no record of what occurred during the incident. This article provides guidance on how to approach such situations and gather the necessary information for effective problem resolution.

Environment

VMware vSphere ESXi (various versions)

Cause

The root cause of intermittent ESXi host unresponsiveness with no logging can vary, but some possible reasons include:

Hardware issues, such as faulty components or firmware incompatibilities
Driver or software conflicts
Resource exhaustion, such as memory or CPU constraints
Network connectivity problems

Resolution

To troubleshoot ESXi host unresponsiveness with no logging, follow these steps:

Enable ESXi Dump Collector:
a. Navigate to the vCenter Server Appliance Management Interface (VAMI).
b. Go to "Services" and locate the "ESXi Dump Collector" service.
c. Click "Edit Startup Type" and set it to "Automatic".
d. Click "Start" to enable the service.
Configure ESXi hosts to save core dumps:
a. Connect to each ESXi host command console, such as by using SSH.
b. Run the following command to enable core dumps:

esxcli system coredump network set --interface-name=vmk0 --server-ipv4=<vCenter_IP_Address> --server-port=6500

c. Replace <vCenter_IP_Address> with the IP address of your vCenter Server.
Set up ESXi host to capture diagnostic information on Purple Screen of Death (PSOD)
Part of the troubleshooting process may involve forcing the host to generate a PSOD
a. Connect to the ESXi host using SSH.
b. Run the following command to enable PSOD diagnostic information collection:

esxcli system settings advanced set -o /UserVars/SuppressCoredumpWarning -i 1
Configure ESXi hosts to capture VMkernel core dumps:
a. Connect to each ESXi host using SSH. b. Run the following command to enable VMkernel core dumps:

esxcli system coredump file set --enable true
Monitor and collect diagnostic information:
a. If an ESXi host becomes unresponsive, connect to the host's management interface (e.g., iDRAC, iLO) and capture a screenshot of the console.
b. Attempt to trigger a Non-Maskable Interrupt (NMI) to generate a VMkernel core dump:
• For Dell servers, use the "NMI" option in the iDRAC interface.
• For HPE servers, use the "Generate NMI" option in the iLO interface
c. Collect the core dump files from the ESXi Dump Collector in the vCenter Server.
d. Gather the VM support logs from the affected ESXi hosts.
Engage VMware Support:
a. Open a support case with VMware and provide the collected diagnostic information, including core dumps, VM support logs, and console screenshots.
b. Work with VMware Support to analyze the data and identify the root cause of the issue.
Engage vendor support:
If the issue is suspected to be related to hardware, engage the server vendor's support team for further assistance.