ESXi Host becomes unresponsive or crashes due to Memory Controller Errors

search cancel

ESXi Host becomes unresponsive or crashes due to Memory Controller Errors

book

Article ID: 373203

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In a scenario where hardware encounters large number of corrected memory errors, ESXi Server may exhibit following symptoms

Host may become unresponsive for an extended period or reboots after the error message
Increased storage latency leading to intermittent storage access issue
Intermittent Guest OS freeze
Server may experience a PSOD in cases where,
- Uncorrected memory errors
- Prolonged CPU lock ups.
In the absence of PSOD, host may eventually resume normal operation without intervention.

Sample log entries indicating the issue:-

1. Memory Controller Errors found in /var/log/vmkernel.log, such as:
YYYY-MM-DDTHH:MM:SS cpu##:##)MCA: ##: CE Poll G0 B8 S9c0000####0091 A6619##### M200401c0898##### P6619f37#####/40 Memory Controller Read Error on Channel 1.

2. hostd Service Hang found in /var/log/vmkernel.log, such as:
YYYY-MM-DDTHH:MM:SS cpu#:####)ALERT: hostd detected to be non-responsive

3. Storage I/O issues found in /var/log/vmkernel.log, such as:
YYYY-MM-DDTHH:MM:SS cpu#:####)WARNING: ScsiDeviceIO: 1513: Device naa.############### performance has deteriorated. I/O latency increased from average value of 1234 microseconds to 123456 microseconds.
YYYY-MM-DDTHH:MM:SS cpu#:####)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.#################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0

4. Storage I/O Issues found in /var/log/hostd.log, such as:
YYYY-MM-DDTHH:MM:SS warning hostd[####] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/#####-#####-####-###########/vm-name/vm-name.vmx") took over 1236 sec.

5. EFI + VMB messages soon after the controller read error:

YYYY-MM-DDTHH:MM:SS ************** vmkernel: cpu53:***** )MCA: 202: CE Intr G0 B19 S************ Ad********** M************ P**********/40 Memory Controller Read Error on Channel 0.
YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 65: Reserved * MPNs starting @ ****
YYYY-MM-DDTHH:MM:SS ************** vmkernel: EFI: 250: 64-bit EFI v2.80 revision ******
YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB_SERIAL: 170: Serial port configuration obtained from firmware.
YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 79: TDX: Unsupported on CPU (MSR_MTRRCAP = ******)

These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.

Environment

VMware vSphere ESXi 7.X
VMware vSphere ESXi 8.X

Cause

Memory controller errors can keep the CPU busy during the error correction phase. This in turn triggers series of CPU lockups, storage I/O issues, ultimately causing the hostd service to hang.

Resolution

Engage the hardware vendor for further diagnostics and resolution.

Feedback

thumb_up Yes

thumb_down No