ESXi Host becomes unresponsive or reboots unexpectedly without core dumps
search cancel

ESXi Host becomes unresponsive or reboots unexpectedly without core dumps

book

Article ID: 408216

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • The ESXi host experiences a sudden loss of availability or an unexpected restart. These specific conditions indicate the issue
  • The host is unresponsive in vCenter Server, ESXi host client, and the Direct Console User Interface (DCUI).
  • Unable to manage the host via 3rd party remote management (e.g., iDRAC, iLO, IPMI).
  • No Purple Screen of Death (PSOD) or core dumps are found, despite a valid core dump partition or file being configured.
  • /var/run/log/vmkernel.log and /var/run/log/vobd.log do not show any kernel, storage, or network driver errors leading up to the event.

  • Host Unexpectedly reboots or has to be restarted as seen in the IPMI logs and cim-diagnostics.sh:
    Assert + System ACPI Power State S4/S5: soft-off

  • /var/run/log/vobd.log may show records a power-off event that was not initiated by the ESXi kernel: 
    [vob.user.host.poweroff.reason.unavailable] The host is being powered off. The poweroff was not the result of a kernel error, deliberate reboot, or shut down. This could indicate a hardware issue. Hardware may reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.
  • /var/run/log/hostd.log contains the following:
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100106]: [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
    [YYYY-MM-DDTHH:MM:SS]In(166) Hostd[2100066]: -->    key = 130,
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    chainId = -945244864,
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    createdTime = "YYYY-MM-DDT00:00:00Z",
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    userName = "",
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    host = (vim.event.HostEventArgument) {
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->       name = "hostfqdn",
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->       host = 'vim.HostSystem:ha-host'
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    },
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    eventTypeId = "esx.audit.host.poweroff.reason.unavailable",
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: -->    objectId = "ha-host",
    [YYYY-MM-DDTHH:MM:SS]In(166) Hostd[2100066]: -->    objectType = "vim.HostSystem",
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100066]: --> }
    [YYYY-MM-DDTHH:MM:SS] In(166) Hostd[2100106]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 259 : Host had been powered off. The poweroff was not the result of a kernel error, deliberate reboot, or shut
    down. This could indicate a hardware issue. Hardware may reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.

Cause

This behavior typically indicates a hardware-level interruption that occurs outside of the ESXi Operating System's control.

Common causes include:

Physical Power Loss: Sudden loss of power to the server or a faulty Power Supply Unit (PSU).

Hardware Faults: Motherboard failure, CPU thermal trip (overheating), or memory errors that trigger a hard reset.

Automatic System Recovery (ASR): If enabled in the BIOS/Firmware, the hardware may reboot the server automatically upon a hang. This often occurs so quickly that ESXi cannot write a PSOD or core dump to disk.

Resolution

System logs exclude the kernel as the source of the shutdown. Further diagnostic efforts should focus on physical hardware and power components.

  1. Engage hardware vendor: Contact your server OEM (Dell, HPE, Lenovo, etc.) to run deep hardware diagnostics. 

  2. Verify power Integrity: Check with the datacenter facility team for power fluctuations, tripped breakers, or PDU failures during the time of the outage.

  3. Disable ASR for troubleshooting: Temporarily disable Automatic System Recovery (ASR) or "Wake-on-LAN" features in the BIOS.

    Note: This allows the host to remain in a halted state (PSOD) if a kernel error occurs, providing the opportunity to capture diagnostic data.

  4. Check thermal status: Review ambient temperature and internal fan speeds in the hardware management logs to rule out thermal shutdowns.