An ESXi host may become unresponsive for an extended period without experiencing a purple screen of death (PSOD) or rebooting. The host eventually resumes normal operation without intervention.
Environment:
- VMware ESXi 7.0 or newer
- Sample log entries indicating the issue:
1. Memory Controller Errors found in /var/log/vmkernel.log, such as:
2024-07-17T23:04:17.991Z cpu22:3361787)MCA: 209: CE Poll G0 B8 S9c00004001010091 A6619f377c0 M200401c089801086 P6619f377c0/40 Memory Controller Read Error on Channel 1.
2. hostd Service Hang found in /var/log/vmkernel.log, such as:
2024-07-17T22:24:35.571Z cpu6:2100050)ALERT: hostd detected to be non-responsive
3. Storage I/O issues found in /var/log/vmkernel.log, such as:
2024-07-17T23:01:34.001Z cpu3:2097735)WARNING: ScsiDeviceIO: 1513: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 1004 microseconds to 107588 microseconds.
2024-07-17T22:37:41.167Z cpu2:2097201)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0
(especially if I/O latency is very high, over 1000000 microseconds, and multiple occurrences of 0xe 0x1d 0x0 within a single second)
4. Storage I/O Issues found in /var/log/hostd.log, such as:
2024-07-17T22:23:21.493Z warning hostd[2098944] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/xxxxxxxx-xxxxxxx-xxxx-xxxxxxxxxxxx/vm-name/vm-name.vmx") took over 36 sec.
4. VMFS Volume Access Issue:
2024-07-17T22:21:37.520Z 2024-07-17T22:23:06.553Z Volume access related events - Last 10 days
UUID Volume Access Lost Access Recovered
----------------------------------- ------------------------- ------------------------ ------------------------
63a1eecc-42195f22-b269-0025b513a03b datastore-name 2024-07-17T22:21:37.520Z 2024-07-17T22:23:06.553Z
These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.
Memory controller errors can trigger a cascade of storage I/O issues, ultimately causing the hostd service to hang. This occurs because storage I/O operations depend on data moving through memory. As the hostd service attempts to manage an increasing number of pending operations due to I/O errors, it quickly exhausts its available heap memory, leading to unresponsiveness.
Resolution:
1. Run memory diagnostics:
a. Boot the affected host into a memory testing utility (e.g., the server manufacturer's diagnostic tool or Memtest86+).
b. Run the memory test for at least 24 hours.
c. If errors are detected, contact your hardware vendor for further assistance or potential hardware replacement.
2. Monitor the host:
a. After applying updates and confirming stable operation, closely monitor the host for any signs of similar issues.
b. Pay particular attention to any storage-related alerts or performance degradation.
5. If issues persist:
a. Collect a new set of ESXi host logs and review for recurring memory or storage errors.
b. Consider engaging your hardware vendor for a more thorough hardware diagnostic if memory tests pass but issues continue.