ESXi Host Not responding: Unexpected Reboot or hung due to hardware failure
search cancel

ESXi Host Not responding: Unexpected Reboot or hung due to hardware failure

book

Article ID: 376723

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • vSphere client shows the ESXi host as "Not Responding" for a while and recovers to the connected state without intervention.
  • In some cases, the ESXi host or blade can be found in hung state and requires a force reboot or power-cycle.
  • Uptime of the host in the vSphere client summary page suggests the server to be going through a OS reboot cycle. 
  • Following a host reboot, the host experiences a persistent hang during hardware initialization stage, resulting in a failure to boot into the ESXi OS.

Configuring and testing memory ..
Configuring platform hardware ...

  • Host hardware BMC shows error "IERR: Sensor Failure Asserted" indicating an internal error due to hardware component failure.

Severity: Critical
Affected object: sys/rack-unit-##/health-led
Reason: IERR: Sensor Failure Asserted;

  • If the host can be booted in to, the System Event Logs confirms the error state with entries similar to the following:

# esxcli hardware ipmi sel list

Record:1:
   Record Id: ##
   When: YYYY-MM-DDTHH:MM
   Event Type: 111 (Unknown)
   SEL Type: 2 (System Event)
   Message: Assert + Processor IERR
   Sensor Number: ###
   Raw:
   Formatted-Raw: 01 00 02 ## ## ## ## ## ## ## ## ## ## ## ## f1

OR

Record:13:
   Record Id: ##
   When: YYYY-MM-DDTHH:MM
   Event Type: 4 (Minor)
   SEL Type: 2 (System Event)
   Message: Assert + Processor Predictive Failure Asserted
   Sensor Number: ###
   Raw:
   Formatted-Raw: 0d 00 02 ## ## ## ## ## ## ## ## ## ## ## ## ff

Record:15:
   Record Id: 15
   When: YYYY-MM-DDTHH:MM
   Event Type: 4 (Minor)
   SEL Type: 2 (System Event)
   Message: Assert + Processor Predictive Failure Asserted
   Sensor Number: ###
   Raw:
   Formatted-Raw: 0f 00 02 ## ## ## ## ## ## ## ## ## ## ## ## ff

  • Mapping Sensor Numbers will suggest the sensors are related to any of the CATERR_N, IERR or MCERR.

Environment

  • VMware vSphere ESXi 8.x
  • VMware vSphere ESXi 9.x

Cause

CATERR_N, IERR, and MCERR are critical hardware alerts generated by a server's processor when it detects unrecoverable internal errors. These events typically indicate a physical hardware fault.

Resolution

Contact the hardware vendor for further diagnostics and resolution.

Additional Information

Steps for reviewing System Event Logs (SEL):

  1. To view the System Event Logs (SEL), use the command esxcli hardware ipmi sel list from a SSH session of ESXi Server
  2. Alternatively, there may be reported events in the vSphere UI following the steps listed below:
    1. Select a host in the vSphere Client navigator.
    2. Click Monitor tab, and click Hardware Health.
    3. Click SYSTEM EVENT LOG.

Mapping the Sensor numbers reported in the System Log Events. 

    • Command esxcli hardware ipmi sdr list from a SSH session of ESXi Server will list the Sensor Data Records providing a mapping of sensors.