ESXi Host Becomes Unresponsive and Unrebootable - Requires Cold Boot During VxRail Upgrades
search cancel

ESXi Host Becomes Unresponsive and Unrebootable - Requires Cold Boot During VxRail Upgrades

book

Article ID: 388024

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

During VxRail upgrades, particularly from ESXi 8.0.311 to 8.0.321, hosts can become unresponsive. The affected hosts typically appear in Maintenance Mode in vCenter, but when a reboot is attempted through vCenter Server, the host transitions to a "Not Responding" state. Virtual machines, including vCLS VMs, may be shown as powered on, but the host cannot be accessed normally. This issue requires a cold boot (complete power cycle) of the host to resolve.

Environment

  • VMware ESXi 8.0.3 Update 3 (builds 24280767 and 24414501)
  • VxRail environments
  • Dell servers with iDRAC firmware version 7.00.00.173
  • vCenter Server managing the ESXi hosts

Cause

The root cause of this issue is a deadlock in the ESXi vmsyslogd service. Analysis of the host's kernel dump shows that the thread responsible for reading messages from the /dev/log socket becomes stuck, preventing it from draining the socket buffer. This causes other processes that need to write log messages to become blocked when the socket buffer fills up.

When multiple critical system processes cannot write their log messages, they become unresponsive, eventually affecting the entire host. This logging issue is exacerbated during reboot operations, which explains why it's seen most frequently during VxRail upgrades that involve multiple host reboots.

Resolution

This issue is fixed in ESXi 8.0 Update 3e (Patch 05).

Immediate Recovery for Affected Hosts

  1. If a host becomes unresponsive and is stuck in Maintenance Mode:

    1. Access the server's iDRAC/BMC interface

    2. Perform a complete power cycle (cold boot) of the ESXi host

    3. Allow the host to boot normally

Temporary Workaround

If you encounter this issue before updating to ESXi 8.0 Update 3e, you can try the following steps when a host becomes unresponsive:

  1. If SSH access to the host is still available:

    1. Try restarting the vmsyslogd service with:

      services.sh restart vmsyslogd

    2. If the host becomes responsive after this, exit maintenance mode normally

  2. If the host is completely unresponsive to remote commands:

    1. Perform an NMI (Non-Maskable Interrupt) via the iDRAC

    2. If the host doesn't recover, a cold boot will be necessary

Additional Information

  • This issue is more likely to occur during operations that involve host reboots, including VxRail upgrades
  • The problem occurs because of a rare deadlock situation in the ESXi log daemon (vmsyslogd)
  • The issue may be triggered by syslog reloads that occur during reboot operations
  • In the observed cases, 10 or more threads can become blocked waiting to write to the /dev/log socket
  • Multiple critical system processes including hostd, vpxa, and dcui can be affected
  • The PR (Problem Report) numbers associated with this issue include: 3465021, 3377479, and 3501897

For more information about ESXi 8.0 Update 3e, see the release notes: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/release-notes/esxi-update-and-patch-release-notes/vsphere-esxi-80u3e-release-notes.html