During VxRail upgrades, particularly from ESXi 8.0.311 to 8.0.321, hosts can become unresponsive. The affected hosts typically appear in Maintenance Mode in vCenter, but when a reboot is attempted through vCenter Server, the host transitions to a "Not Responding" state. Virtual machines, including vCLS VMs, may be shown as powered on, but the host cannot be accessed normally. This issue requires a cold boot (complete power cycle) of the host to resolve.
The root cause of this issue is a deadlock in the ESXi vmsyslogd service. Analysis of the host's kernel dump shows that the thread responsible for reading messages from the /dev/log socket becomes stuck, preventing it from draining the socket buffer. This causes other processes that need to write log messages to become blocked when the socket buffer fills up.
When multiple critical system processes cannot write their log messages, they become unresponsive, eventually affecting the entire host. This logging issue is exacerbated during reboot operations, which explains why it's seen most frequently during VxRail upgrades that involve multiple host reboots.
This issue is fixed in ESXi 8.0 Update 3e (Patch 05).
If you encounter this issue before updating to ESXi 8.0 Update 3e, you can try the following steps when a host becomes unresponsive:
If SSH access to the host is still available:
Try restarting the vmsyslogd service with:services.sh restart vmsyslogd
If the host becomes responsive after this, exit maintenance mode normally
If the host is completely unresponsive to remote commands:
Perform an NMI (Non-Maskable Interrupt) via the iDRAC
If the host doesn't recover, a cold boot will be necessary
For more information about ESXi 8.0 Update 3e, see the release notes: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/release-notes/esxi-update-and-patch-release-notes/vsphere-esxi-80u3e-release-notes.html