Investigating Unexpected ESXi Reboot or Shutdown
search cancel

Investigating Unexpected ESXi Reboot or Shutdown

book

Article ID: 317245

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0

Issue/Introduction

  • ESXi host abruptly rebooted
  • The ESXi host abruptly powered off.
  • Unexpected reboot of host
  • The host encountered a crash and restarted automatically.
  • HA event on host which resulted in VMs on host getting rebooted

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x
VMware vSphere ESX 9.x

Cause

ESXi host reboot or shutdown events can generally be classified into three main categories:
  1. User-initiated: Triggered via CLI, UI, DCUI, IPMI, or similar management interfaces.
  2. Kernel crash: Caused by PSOD events (e.g., MCEs, NMIs, software faults, etc.).
  3. Unknown to ESXi: Typically attributed to hardware-related issues (e.g., power outages, faulty components, etc.).

Resolution

To determine the category of a reboot or shutdown, begin by reviewing the /var/run/log/vmksummary.log on the ESXi host.
The vmksummary log will record an entry every hour at the top of the hour. It also logs useful information to determine unexpected reboot or shutdown information.

  1. User-initiated: Triggered via CLI, UI, DCUI, IPMI, etc..

    • If the ESXi reboot or shutdown was initiated by a user, the /var/run/logvmksummary.log will contain entries similar to the following:

      YYYY-MM-DDTHH:MM:SS.###Z bootstop[######]: Host is halting
      YYYY-MM-DDTHH:MM:SS.###Z Host has booted


      The above logs indicate a user initiated shutdown.

YYYY-MM-DDTHH:MM:SS.###Z bootstop[######]: Host is rebooting
YYYY-MM-DDTHH:MM:SS.###Z Host has booted

The above logs indicate a user initiated reboot.

 

    • An ESXi host can also be shut down or rebooted through out-of-band management interfaces such as iLO or iDRAC. 

      Depending on the hardware setup and type of shutdown or reboot, ACPI events may be sent to ESXi. In these cases, vmksummary.log will not show a halting or reboot message. Instead, the below log message will be printed:

      YYYY-MM-DDTHH:MM:SS.###Z cpu#:#######)VMKAcpi: ###: Power button pressed; requesting graceful shutdown and poweroff

    • To identify the user who initiated the reboot, refer to the Additional Information section of this KB article.

  1. Kernel crash: Caused by PSOD events (e.g., MCEs, NMIs, software faults, etc.).
    • If ESXi has a kernel crash and successfully dumps core, the following messages will appear in the vmksummary.log after the system reboots:

      Log file: var/run/log/vmksummary.log
      
      YYYY-MM-DDTHH:MM:SS.###Z heartbeat[########]: up ##d#h##m##s, ## VMs; [[####### vmx #kB] [####### vmx #kB] [####### vmx #kB]] []
      YYYY-MM-DDTHH:MM:SS.###Z heartbeat[########]: up ##d#h##m##s, ## VMs; [[####### vmx #kB] [####### vmx #kB] [####### vmx #kB]] []
      YYYY-MM-DDTHH:MM:SS.###Z bootstop[######]: file core dump found
      YYYY-MM-DDTHH:MM:SS.###Z Host has booted
    • If the VMware ESXi host has encountered a kernel error, refer to Interpreting an ESXi host purple diagnostic screen

  2. Unknown to ESXi: Typically attributed to hardware-related issues (e.g., power outages, faulty components, etc)
    • If ESXi is unable to determine the cause of a shutdown or reboot, the entries in /var/run/log/vmksummary.log will appear similar to the following:

      YYYY-MM-DDTHH:MM:SS.###Z heartbeat[########]: up ##d#h##m##s, ## VMs; [[####### vmx #kB] [####### vmx #kB] [####### vmx #kB]] []
      YYYY-MM-DDTHH:MM:SS.###Z heartbeat[########]: up ##d#h##m##s, ## VMs; [[####### vmx #kB] [####### vmx #kB] [####### vmx #kB]] []
      YYYY-MM-DDTHH:MM:SS.###Z bootstop[#######]: Host has booted
      YYYY-MM-DDTHH:MM:SS.###Z heartbeat[#######]: up #d#h##m##s, # VM; [[####### hostd #####kB] [####### vsanmgmtd #kB] [####### vmx #kB]] []

 

 

    • In the example above, if the reboot or shutdown was neither user-initiated nor caused by a PSOD, the vmksummary.log will only display the "Host has booted" message.

    • Subsequent error messages will be recorded in the /var/run/log/vobd.log as shown below:

      YYYY-MM-DDTHH:MM:SS.###Z In(##) vobd[#######]:  Successfully sent event ([esx.audit.host.poweroff.reason.unavailable] The host is being powered off. The poweroff was not the result of a kernel error, deliberate reboot, or shut down. This could indicate a hardware issue. Hardware may reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor)

    • If an ESXi host outage is not caused by a user-initiated reboot, shutdown, or kernel error, the physical hardware likely restarted abruptly. This can happen due to power loss, faulty components, or overheating. Contact the hardware vendor to investigate further.

    • Additionally, the following messages can be found in the /var/run/log/hostd.log file. 

YYYY-MM-DDTHH:MM:SS.###Z In(##) Hostd[#####]: [Originator@#### sub=Vimsvc.ha-eventmgr] Event ##### : The host is being powered off through hostd. Reason for powering off: The host is being powered off through Advanced Configuration and Power Interface (ACPI)., User: UNKNOWN_USER.

Note: "Advanced Configuration and Power Interface (ACPI)": ACPI is the industry standard for power management. In this context, this indicates that the server's physical power button was pressed, or a hardware management command was sent via the IPMI (such as iDRAC, iLO, or IMM). The UNKNOWN_USER designation means the action was not performed by a recognized VMware system account.

Additional Information

  • User-initiated: Triggered via CLI, UI, DCUI, IPMI, etc..

    • If a reboot was user initiated with the vSphere client there will be a task logged for the host, visible in the vCenter UI indicating the user that performed the reboot or restart.
      1. Log in to the vSphere Client.
      2. Select the affected host.
      3. Navigate to the Monitor tab.
      4. Select Tasks.

    • This information can also be found in the ESXi hostd logs.

      Log file: /var/run/log/hostd.log
      
      [YYYY-MM-DDTHH:MM:SS] In(###) Hostd[######]: [Originator@#### sub=Vimsvc.TaskManager opID=####-####:####-#### sid=######## user=vpxuser:####-####] Task Created : haTask-ha-host-vim.HostSystem.reboot-##########
      [YYYY-MM-DDTHH:MM:SS] In(###) Hostd[######]: -->    eventTypeId = "esx.audit.hostd.host.reboot.reason",
      [YYYY-MM-DDTHH:MM:SS] In(###) Hostd[######]: -->          value = "The host is being rebooted through hostd."
      [YYYY-MM-DDTHH:MM:SS] In(###) Hostd[######]: [Originator@#### sub=Vimsvc.ha-eventmgr] Event #### : The host is being rebooted through hostd. Reason for reboot: The host is being rebooted through hostd., User: vpxuser:<User_name>.
  • Kernel crash: Caused by PSOD events (e.g., MCEs, NMIs, software faults, etc.).

    • ASR (Automatic Server Recovery) is a feature available on hardware from some vendors that can automatically reboot a host that appears to be hung or crashed. While ASR can be useful for restoring a host to a functional state without manual intervention, its configuration requires a careful balance between uptime and diagnostic visibility
    • We recommend you consider the following when deciding when and how to configure ASR:

      • Maximize Timeout: If you are enabling ASR, set its timeout to the maximum value available. This gives ESXi more time to capture vital diagnostic data during a PSOD, before ASR resets the system. Example: For Dell iDRAC, this is typically 720 seconds.
      • Debugging Trade-Off: For scenarios where root-cause analysis is more critical than immediate uptime, consider disabling ASR. This prevents the hardware from forcing a reboot while the OS is in a hung or crashed state, allowing support teams more time to manually inspect the system, capture the PSOD contents, and ensure a complete core dump is written to disk.
      • Detection of reboot reason: Current ESXi releases do not automatically detect during boot that the system was reset via ASR. This causes the reboot to fall into the "unknown reason" category.  Use this standalone script to detect if ASR was cause of the system reset (dell_ASR_rebootreason_analyzer_GSS_v1.py)

Steps to disable ASR:

    • Dell: Check with Dell for latest steps on this procedure):

      • Log on to the iDRAC with the Admin account
      • Click on "iDRAC Settings" tab
      • Click on the "Settings" tab
      • Go to iDRAC Service Module tab
      • Change the settings Service on Host OS - to "Disabled





    • HPE: 

      • HPE hosts have an ASR feature, however it is active only during power-on initialization of firmware, not while ESXi is running.

  • Unknown to ESXi: Typically attributed to hardware-related issues (e.g., power outages, faulty components, etc)

    • If the cause of the reboot is unknown, it may be helpful to review the information provided to ESXi by the IPMI controller. To obtain this data, run the following command on the ESXi shell:
      localcli hardware ipmi sel list
    • Note that this information is also recorded in the host's out-of-band management tool (such as iLO or iDRAC). Engage the hardware vendor for further clarification on IPMI messages.

      Example event:
      
      Record:####:
      Record Id: ####
      When: ####-##-##T##:##:##
         Event Type: ### (Unknown)
         SEL Type: # (System Event)
         Message: Deassert + Power Supply Presence detected
      Sensor Number: ##
      Raw:
      Formatted-Raw: ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##