Hardware faults on ESXi hosts
search cancel

Hardware faults on ESXi hosts

book

Article ID: 336323

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

When a hardware error occurs, the host generates an alert and indicates the hardware problem on the hardware monitor tab. However, the alert is only displayed while the hardware error occurs and the alert sometimes clears. This does not indicate that the hardware fault has stopped occurring, but that the indications of the fault stopped. When a fault does occur, even if it displays in a transitory state, the host logs the fault in the hardware log, and the CIM diagnostics log. This is useful when troubleshooting odd problems with ESXi hosts and virtual machines that reside on those hosts.

Symptoms

You might experience the following behavior when a generic ESXi host fault occurs:
  • Erratic host behavior
  • Purple screen errors
  • Corrupt disk drives
  • Erratic virtual machine behavior

Hosts hardware log or CIM log displays alerts

The following categories are the severity of states that indicate required action to resolve with examples of the log entries below.

Processor Errors:
  • Processor IERR
  • Processor Thermal Trip
  • Processor Configuration Error
  • Processor Machine Check Exception
  • Processor Correctable Machine Check

Memory Errors:
  • Memory Configuration Error
  • Memory Uncorrectable ECC
  • Memory Transition to Critical
  • Memory Critical Overtemperature

Disk Errors:
  • Drive Slot In Critical Array
  • Drive Slot In Failed Array
  • Drive Bay in Critical Array
  • Drive Bay in Failed Array

Bus Errors:
  • PCI PERR
  • PCI SERR
  • Bus Correctable Error
  • Bus Uncorrectable Error
  • Bus Fatal Error
  • Add-in Card Install Error
  • Cable/Interconnect Transition to Critical from less severe
  • Slot/Connector Transition to Critical
  • Slot/Connector Transition to Non-critical

Fan Errors:
  • Fan Transition to Critical from less severe
  • Fan Transition to Off Line

Temperature Errors:
  • Temperature Lower Critical going low
  • Temperature Transition to Critical from less severe
  • Temperature Transition to Non-recoverable from less severe
  • Temperature Upper Critical going high

Voltage Errors:
  • Voltage Limit Exceeded
  • Voltage Transition to Critical from less severe

Example

The following is an example of what the CIM diagnostic log might display:

OMC_IpmiLogRecord.CreationClassName="OMC_IpmiLogRecord",LogCreationClassName="OMC_IpmiRecordLog",LogName="IPMI SEL",MessageTimestamp="20121205114249.000000+000",RecordID="1"
RecordID = 1
MessageTimestamp = (NULL)
LogName = IPMI SEL
LogCreationClassName = OMC_IpmiRecordLog
CreationClassName = OMC_IpmiLogRecord
RecordFormat = *string CIM_Sensor.DeviceID*uint8[2] IPMI_RecordID*uint8 IPMI_RecordType*uint8[4] IPMI_Timestamp*uint8[2] IPMI_GeneratorID*uint8 IPMI_EvMRev*uint8 IPMI_SensorType*uint8 IPMI_SensorNumber*boolean IPMI_AssertionEvent*uint8 IPMI_EventType*uint8 IPMI_EventData1*uint8 IPMI_EventData2*uint8 IPMI_EventData3*uint32 IANA*
RecordData = *114.0.32*1 0*2*57 51 191 80*32 0*4*16*114*false*111*2*255*255*1*
ElementName = IPMI SEL
Description = Assert + Voltage Transition to Critical from less severe
Caption = Assert + Voltage Transition to Critical from less severe
PerceivedSeverity = (NULL)
Locale = (NULL)
InstanceID = (NULL)
DataFormat = (NULL)


Environment

VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.5
VMware ESXi 3.5.x Installable
VMware ESXi 4.1.x Installable
VMware vSphere ESXi 5.0
VMware ESXi 4.0.x Installable
VMware ESX 4.1.x
VMware vSphere ESXi 6.0
VMware ESX Server 3.5.x
VMware vSphere ESXi 6.5
VMware ESX 4.0.x
VMware ESX 7.x

Resolution

Contact your vendor for support for further troubleshooting and assistance.

The Intelligent Platform Management Interface (IPMI) defines standards on how monitoring and control of system subsystems. These standards are also used for monitoring elements such as temperatures, voltages, fans, bus errors, memory, and so on. This system provides a variety of alarm mechanisms when a system exceeds its tolerance levels. For example, an error for a processor might be displayed actively but only while the error is active. The point of the logging mechanism is to determine if an error occurred in the past which can indicate that the host is still experiencing fault conditions and might not be reporting these faults. This generally warrants more detailed investigation with the hardware vendor.