Uncorrectable Memory Failure
Operational impact
The server restarts, with the affected DIMM disabled. The server can immediately return to production with the remaining memory. If the remaining memory is insufficient for production, replace the DIMM immediately or at the next maintenance opportunity.
Indications at the time of the failure
The system-error LED, MEM LED (in a server with a light path diagnostics panel), and the affected DIMM connector error LED are lit. An Uncorrectable ECC Error platform event is logged in the system-event log.
Possible root causes
Uncorrectable memory ECC error (data line), DIMM address parity error, damaged DIMM connector, damaged processor or socket.
Suggested corrective action
Replace the DIMM at the next maintenance opportunity. If the problem persists, follow the memory problem determination procedures to isolate a potentially failing part.
Correctable memory failures (Predictive Failure Analysis alert)
Operational impact
The server continues to operate, with possible degradation in performance. For example, a DIMM with a defective or open data line.
Indications at the time of the failure
The system-error LED, MEM LED (in a server with a light path diagnostics panel), and the affected DIMM connector error LED are lit. A Correctable ECC Error Rate Exceeded platform event is logged in the system-event log.
Possible root cause
The most possible root cause is the failure of the DIMM and the less likely root cause is the spurious noise caused by power rail regulation or another physical anomaly.
Suggested corrective action
Check your hardware vendor website for possible firmware updates and RETAIN tips that pertain to memory Predictive Failure Analysis alerts. Replace the DIMM at the next maintenance opportunity, because the DIMM may be failing and may result in unscheduled downtime. Follow the memory problem procedures to isolate a potential failure.
For more information and a procedure to run a full hardware diagnostic to locate the issue, contact your hardware vendor.