ESXi host becomes unresponsive and unrebootable

search cancel

ESXi host becomes unresponsive and unrebootable - requires cold boot

book

Article ID: 388024

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

For VxRail environment upgrades, particularly from ESXi 8.0.311 to 8.0.321, hosts can become unresponsive. The affected hosts typically appear in Maintenance Mode in vCenter from the upgrade process, but when a reboot is attempted through vCenter Server, the host transitions to a "Not Responding" state.
Non-VxRail ESXi hosts are also affected
Virtual machines, including vCLS VMs, may be shown as powered on, but the host cannot be accessed normally. This issue requires a cold boot (complete power cycle) of the host to resolve.

Environment

VMware ESXi 8.0.3 Update 3 (builds 24280767 and 24414501)
VxRail and non-VxRail environments
Dell servers with iDRAC firmware version 7.00.00.173, but not limited to
vCenter Server managing the ESXi hosts

Cause

The root cause of this issue is a deadlock in the ESXi vmsyslogd service. Analysis of the host's kernel dump shows that the thread responsible for reading messages from the /dev/log socket becomes stuck, preventing it from draining the socket buffer. This causes other processes that need to write log messages to become blocked when the socket buffer fills up.

When multiple critical system processes cannot write their log messages, they become unresponsive, eventually affecting the entire host. This logging issue is exacerbated during reboot operations, which explains why it's seen most frequently during VxRail upgrades that involve multiple host reboots.

Resolution

This issue is fixed in ESXi 8.0 Update 3e (Patch 05).

Immediate Recovery for Affected Hosts

If a host becomes unresponsive (and is stuck in Maintenance Mode, if part of a VxRail upgrade)
1. Access the server's iDRAC/BMC interface
2. Perform a complete power cycle (cold boot) of the ESXi host
3. Allow the host to boot normally

Temporary Workaround

If you encounter this issue before updating to ESXi 8.0 Update 3e, you can try the following steps when a host becomes unresponsive:

If SSH access to the host is still available:
1. Try restarting the vmsyslogd service with:
  
  services.sh restart vmsyslogd
2. If the host becomes responsive after this, exit maintenance mode normally
If the host is completely unresponsive to remote commands:
1. Perform an NMI (Non-Maskable Interrupt) via the iDRAC
2. If the host doesn't recover, a cold boot will be necessary

Additional Information

This issue is more likely to occur during operations that involve host reboots, including VxRail upgrades
The problem occurs because of a rare deadlock situation in the ESXi log daemon (vmsyslogd)
The issue may be triggered by syslog reloads that occur during reboot operations
In the observed cases, 10 or more threads can become blocked waiting to write to the /dev/log socket
Multiple critical system processes including hostd, vpxa, and dcui can be affected

For more information about ESXi 8.0 Update 3e, see the release notes.

Feedback

thumb_up Yes

thumb_down No