ESXi Host becomes unresponsive or crashes due to Memory Controller Errors
search cancel

ESXi Host becomes unresponsive or crashes due to Memory Controller Errors

book

Article ID: 373203

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In a scenario where hardware encounters large number of corrected memory errors, ESXi Server may exhibit following symptoms

  • Host may become unresponsive for an extended period or reboots after the error message
  • Increased storage latency leading to intermittent storage access issue
  • Intermittent Guest OS freeze
  • Server may experience a PSOD in cases where,
    • Uncorrected memory errors
    • Prolonged CPU lock ups.
  • In the absence of PSOD, host may eventually resume normal operation without intervention.

    Sample log entries indicating the issue:-

  1. Memory Controller Errors found in /var/log/vmkernel.log, such as:
      YYYY-MM-DDTHH:MM:SS cpu##:##)MCA: ##: CE Poll G0 B8 S9c0000####0091 A6619##### M200401c0898##### P6619f37#####/40 Memory Controller Read Error on Channel 1.

  2. hostd Service Hang found in /var/log/vmkernel.log, such as:
      YYYY-MM-DDTHH:MM:SS cpu#:####)ALERT: hostd detected to be non-responsive

  3. Storage I/O issues found in /var/log/vmkernel.log, such as:
    YYYY-MM-DDTHH:MM:SS cpu#:####)WARNING: ScsiDeviceIO: 1513: Device naa.############### performance has deteriorated. I/O latency increased from average value of 1234 microseconds to 123456 microseconds.
    YYYY-MM-DDTHH:MM:SS cpu#:####)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.#################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0

  4. Storage I/O Issues found in /var/log/hostd.log, such as:
      YYYY-MM-DDTHH:MM:SS warning hostd[####] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/#####-#####-####-###########/vm-name/vm-name.vmx") took over 1236 sec.

  5. EFI + VMB messages soon after the controller read error:

   YYYY-MM-DDTHH:MM:SS ************** vmkernel: cpu53:***** )MCA: 202: CE Intr G0 B19 S************ Ad********** M************ P**********/40 Memory Controller Read Error on Channel 0.
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 65: Reserved * MPNs starting @ ****
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: EFI: 250: 64-bit EFI v2.80 revision ******
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB_SERIAL: 170: Serial port configuration obtained from firmware.
   YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 79: TDX: Unsupported on CPU (MSR_MTRRCAP = ******)

These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.

Environment

  • VMware vSphere ESXi 7.X
  • VMware vSphere ESXi 8.X

Cause

Memory controller errors can keep the CPU busy during the error correction phase. This in turn triggers series of CPU lockups, storage I/O issues, ultimately causing the hostd service to hang.

Resolution

Engage the hardware vendor for further diagnostics and resolution.