ESXi Host Becomes Unresponsive Due to Memory Controller Errors Leading to Storage I/O Issues
search cancel

ESXi Host Becomes Unresponsive Due to Memory Controller Errors Leading to Storage I/O Issues

book

Article ID: 373203

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

An ESXi host may become unresponsive for an extended period without experiencing a purple screen of death (PSOD) or rebooting. The host eventually resumes normal operation without intervention.

Environment

Environment:
- VMware ESXi 7.0 or newer
- Sample log entries indicating the issue:

  1. Memory Controller Errors found in /var/log/vmkernel.log, such as:
     2024-07-17T23:04:17.991Z cpu22:3361787)MCA: 209: CE Poll G0 B8 S9c00004001010091 A6619f377c0 M200401c089801086 P6619f377c0/40 Memory Controller Read Error on Channel 1.

  2. hostd Service Hang found in /var/log/vmkernel.log, such as:
     2024-07-17T22:24:35.571Z cpu6:2100050)ALERT: hostd detected to be non-responsive

  3. Storage I/O issues found in /var/log/vmkernel.log, such as:
   2024-07-17T23:01:34.001Z cpu3:2097735)WARNING: ScsiDeviceIO: 1513: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 1004 microseconds to 107588 microseconds.
   2024-07-17T22:37:41.167Z cpu2:2097201)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0
   (especially if I/O latency is very high, over 1000000 microseconds, and multiple occurrences of 0xe 0x1d 0x0 within a single second)

  4. Storage I/O Issues found in /var/log/hostd.log, such as:
     2024-07-17T22:23:21.493Z warning hostd[2098944] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/xxxxxxxx-xxxxxxx-xxxx-xxxxxxxxxxxx/vm-name/vm-name.vmx") took over 36 sec.


  4. VMFS Volume Access Issue:
     2024-07-17T22:21:37.520Z 2024-07-17T22:23:06.553Z Volume access related events - Last 10 days
     UUID                                Volume                    Access Lost              Access Recovered
     ----------------------------------- ------------------------- ------------------------ ------------------------
     63a1eecc-42195f22-b269-0025b513a03b datastore-name            2024-07-17T22:21:37.520Z 2024-07-17T22:23:06.553Z

These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.

Cause

Memory controller errors can trigger a cascade of storage I/O issues, ultimately causing the hostd service to hang. This occurs because storage I/O operations depend on data moving through memory. As the hostd service attempts to manage an increasing number of pending operations due to I/O errors, it quickly exhausts its available heap memory, leading to unresponsiveness.

Resolution

Resolution:
1. Run memory diagnostics:
   a. Boot the affected host into a memory testing utility (e.g., the server manufacturer's diagnostic tool or Memtest86+).
   b. Run the memory test for at least 24 hours.
   c. If errors are detected, contact your hardware vendor for further assistance or potential hardware replacement.

2. Monitor the host:
   a. After applying updates and confirming stable operation, closely monitor the host for any signs of similar issues.
   b. Pay particular attention to any storage-related alerts or performance degradation.

5. If issues persist:
   a. Collect a new set of ESXi host logs and review for recurring memory or storage errors.
   b. Consider engaging your hardware vendor for a more thorough hardware diagnostic if memory tests pass but issues continue.