In a scenario where hardware encounters large number of corrected memory errors, ESXi Server may exhibit following symptoms
1. Memory Controller Errors found in /var/log/vmkernel.log, such as: YYYY-MM-DDTHH:MM:SS cpu##:##)MCA: ##: CE Poll G0 B8 S9c0000####0091 A6619##### M200401c0898##### P6619f37#####/40 Memory Controller Read Error on Channel 1.
2. hostd Service Hang found in /var/log/vmkernel.log, such as: YYYY-MM-DDTHH:MM:SS cpu#:####)ALERT: hostd detected to be non-responsive
3. Storage I/O issues found in /var/log/vmkernel.log, such as:
YYYY-MM-DDTHH:MM:SS cpu#:####)WARNING: ScsiDeviceIO: 1513: Device naa.############### performance has deteriorated. I/O latency increased from average value of 1234 microseconds to 123456 microseconds. YYYY-MM-DDTHH:MM:SS cpu#:####)ScsiDeviceIO: 4176: Cmd(0x45b96d402508) 0x89, CmdSN 0x1578666 from world 3361402 to dev "naa.#################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0
4. Storage I/O Issues found in /var/log/hostd.log, such as: YYYY-MM-DDTHH:MM:SS warning hostd[####] [Originator@6876 sub=IoTracker] In thread 2099376, fopen("/vmfs/volumes/#####-#####-####-###########/vm-name/vm-name.vmx") took over 1236 sec.
5. EFI + VMB messages soon after the controller read error:
YYYY-MM-DDTHH:MM:SS ************** vmkernel: cpu53:***** )MCA: 202: CE Intr G0 B19 S************ Ad********** M************ P**********/40 Memory Controller Read Error on Channel 0. YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 65: Reserved * MPNs starting @ **** YYYY-MM-DDTHH:MM:SS ************** vmkernel: EFI: 250: 64-bit EFI v2.80 revision ****** YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB_SERIAL: 170: Serial port configuration obtained from firmware. YYYY-MM-DDTHH:MM:SS ************** vmkernel: VMB: 79: TDX: Unsupported on CPU (MSR_MTRRCAP = ******)
These log entries demonstrate the sequence of events: memory controller errors, followed by storage I/O issues, leading to the hostd service becoming unresponsive, and VMFS volume access problems.
Memory controller errors can keep the CPU busy during the error correction phase. This in turn triggers series of CPU lockups, storage I/O issues, ultimately causing the hostd service to hang.
Engage the hardware vendor for further diagnostics and resolution.