ESXi Host Becomes Unresponsive or crashes Due to Memory Controller Errors
search cancel

ESXi Host Becomes Unresponsive or crashes Due to Memory Controller Errors

book

Article ID: 373203

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In a scenario where hardware encounters large number of corrected memory errors, ESXi Server may exhibit following symptoms

  • Host may become unresponsive for an extended period
  • Increased storage latency leading to intermittent storage access issue
  • Intermittent Guest OS freeze
  • Server may experience a PSOD in cases where,
    • Uncorrected memory errors
    • Prolonged CPU lock ups.

In the absence of PSOD, host may eventually resumes normal operation without intervention.

Environment

- VMware ESXi 7.0 or newer
- Sample log entries indicating the issue:

  1. Memory Controller Errors found in /var/log/vmkernel.log, such as:

     YYYY-MM-DDTHH:MM:SS.991Z cpu22:3361787)MCA: 209: CE Poll G0 B8 S9c00004001010091 A6619XXXXX M200401c0898XXXXXX P6619f37XXX/40 Memory Controller Read Error on Channel 1.

  2. hostd Service Hang found in /var/log/vmkernel.log, such as:
      YYYY-MM-DDTHH:MM:SS.571Z cpu6:2100050)ALERT: hostd detected to be non-responsive

  3. Storage I/O issues found in /var/log/vmkernel.log, such as:
    YYYY-MM-DDTHH:MM:SS.001Z cpu3:2097735)WARNING: ScsiDeviceIO: 1513: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 1004 microseconds to 107588 microseconds.
    YYYY-MM-DDTHH:MM:SS.167Z cpu2:2097201)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0

  4. Storage I/O Issues found in /var/log/hostd.log, such as:
      YYYY-MM-DDTHH:MM:SS.493Z warning hostd[2098944] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/xxxxxxxx-xxxxxxx-xxxx-xxxxxxxxxxxx/vm-name/vm-name.vmx") took over 1236 sec.

These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.

Cause

Memory controller errors can keep the CPU busy during the error correction phase. This in turn triggers series of CPU lockups, storage I/O issues, ultimately causing the hostd service to hang.

Resolution

Engage the hardware vendor for further diagnostics and resolution.