ESXi Host becomes unresponsive or crashes due to Memory Controller Errors
search cancel

ESXi Host becomes unresponsive or crashes due to Memory Controller Errors

book

Article ID: 373203

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In a scenario where hardware encounters large number of corrected memory errors, ESXi Server may exhibit following symptoms

  • Host may become unresponsive for an extended period or reboots after the error message
  • Increased storage latency leading to intermittent storage access issue
  • Intermittent Guest OS freeze
  • Server may experience a PSOD in cases where,
    • Uncorrected memory errors
    • Prolonged CPU lock ups.
Frequent Correctable Errors: A high volume of correctable memory errors floods the CPU with interrupts (CMCI/MCE) and System Management Interrupts (SMIs). This overwhelms the processor and ties up the BIOS in System Management Mode (SMM), leading to a CPU lockup and subsequent PSOD.
Infrequent Correctable Errors: The system actively handles intermittent correctable errors without experiencing CPU lockup or impacting system stability.
Uncorrectable Errors: A single fatal, uncorrectable memory error will immediately trigger a PSOD.
  • In the absence of PSOD, host may eventually resume normal operation without intervention. Examples as below;
    • If The memory controller successfully completes the error correction phase (Correctable Errors / CE) without encountering an uncorrectable fault.
    • The CPU cycle consumption during the correction phase resolves before triggering a definitive kernel panic state such as temporary CPU lockup. Host can recovery on its own without manual intervention


Sample log entries indicating the issue:-

  1. Memory Controller Errors found in /var/log/vmkernel.log, such as:
      YYYY-MM-DDTHH:MM:SS cpu##:##)MCA: ##: CE Poll G0 B8 S9c0000####0091 A6619##### M200401c0898##### P6619f37#####/40 Memory Controller Read Error on Channel 1.

  2. hostd Service Hang found in /var/log/vmkernel.log, such as:
      YYYY-MM-DDTHH:MM:SS cpu#:####)ALERT: hostd detected to be non-responsive

  3. Storage I/O issues found in /var/log/vmkernel.log, such as:
    YYYY-MM-DDTHH:MM:SS cpu#:####)WARNING: ScsiDeviceIO: 1513: Device naa.############### performance has deteriorated. I/O latency increased from average value of 1234 microseconds to 123456 microseconds.
    YYYY-MM-DDTHH:MM:SS cpu#:####)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.#################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0

  4. Storage I/O Issues found in /var/log/hostd.log, such as:
      YYYY-MM-DDTHH:MM:SS warning hostd[####] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/#####-#####-####-###########/vm-name/vm-name.vmx") took over 1236 sec.

  5. EFI + VMB messages soon after the controller read error:

   YYYY-MM-DDTHH:MM:SS ************** vmkernel: cpu53:***** )MCA: 202: CE Intr G0 B19 S************ Ad********** M************ P**********/40 Memory Controller Read Error on Channel 0.
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 65: Reserved * MPNs starting @ ****
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: EFI: 250: 64-bit EFI v2.80 revision ******
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB_SERIAL: 170: Serial port configuration obtained from firmware.
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 79: TDX: Unsupported on CPU (MSR_MTRRCAP = ******)

These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.

Environment

  • VMware vSphere ESXi 7.X
  • VMware vSphere ESXi 8.X

Cause

Memory controller errors can keep the CPU busy during the error correction phase. This in turn triggers series of CPU lockups, storage I/O issues, ultimately causing the hostd service to hang.

Resolution

Engage the hardware vendor for further diagnostics and resolution.