Host fails with PSOD referencing error "I/O error reported by PCI"
search cancel

Host fails with PSOD referencing error "I/O error reported by PCI"

book

Article ID: 394465

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • PSOD screen may have references to "Uncorrectable/unrecoverable machine check errors"  and "Machine Check Exception"
  • A VMkernel crash dump may not be available if the error is caused by and storage IO device.
  • Entries similar to the below can be seen in PSOD screen or crash dump/logs if available.
    YYYY-MM-DDTHH:MM:SS.114Z cpu58:40848027)IDT: 1895: Uncorrectable/unrecoverable machine check error
    YYYY-MM-DDTHH:MM:SS.114Z cpu58:40848027)MCA: 208: UC Excp G4 86 Sbb00002000000e0b AB M180008 P8/8 I/O error reported by PCI 0000:00:03.0.

Environment

7.x
8.x

Cause

When hardware encounters a critical/fatal error, a machine check exception (MCE) is raised by CPU. As the machine check exceptions are considered fatal and unrecoverable, ESXi Server is expected to crash with a PSOD.

System event logs (IPMI log) entries which can be retrieved can be retrieved using command esxcli hardware ipmi sel list can help to confirm the cause. 

Record:18318:
 When: 2025-03-14T10:54:53
 Event Type: 4 (Minor)
 SEL Type: 2 (System Event)
 Message: Assert + Processor Predictive Failure Asserted
 Sensor Number: 80
Record:18321:
 When: 2025-03-14T10:54:53
 Event Type: 4 (Minor)
 SEL Type: 2 (System Event)
 Message: Assert + Processor Predictive Failure Asserted
 Sensor Number: 121

On this sampled hardware Sensor Number 80 is marked as Processor 1 P_CATERR and Sensor Number 121 is marked as Processor 1 IERRIERR suggests the error was caused by an IO Device connected to the system board. 

Note: Sensor number can differ based on hardware vendor/model and BIOS. Please check the Sensor Data Records using command: esxcli hardware ipmi sdr list to map the sensor numbers mentioned in the error record.

Resolution

Note: Take a screenshot of the the server console before restarting the server. 

  • A reboot of the server can help to restore the server status in case of transient errors.

  • Engage hardware vendor for a detailed investigation on the source of unrecoverable errors related to I/O Device.

Additional Information