ESXi host becomes unresponsive due to memory correctable training errors
search cancel

ESXi host becomes unresponsive due to memory correctable training errors

book

Article ID: 432516

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

An ESXi host in a vSAN cluster becomes unresponsive and shows as "Not Responding" in vCenter Server. Local console access reveals that the hostd process is non-responsive, and the system may become completely locked, preventing log collection or CLI interaction.

Symptoms:

  • Host status "Not Responding" in vSphere Client.
  • hostd detected to be non-responsive in system logs.

Environment

VMware vSAN 7.x

Cause

The issue is caused by a hardware-level memory failure. Specifically, the Unified Extensible Firmware Interface (UEFI) Hardware Management Console detects correctable training errors on a specific memory module (e.g., Slot B1). While "correctable," these errors can escalate or cause timing issues that result in the ESXi kernel or critical management agents (hostd) hanging.

Resolution

To resolve this issue, follow these steps:

  1. Review the hardware event logs (System Event Log - SEL) via the hardware management interface (iDRAC, ILO, IPMI). Look for the following signature: UEFI####: One or more memory correctable training errors have occurred on memory slot: <SLOT_ID>.

  2. If the host is currently unresponsive, perform a cold boot of the physical server to clear the hung state and allow the host to reconnect to vCenter.

  3. Since this is a physical layer failure, contact your hardware vendor to:

    • Perform a diagnostic stress test on the identified memory slot.

    • Reseat the memory module in the specified slot (e.g., Slot B1).

    • Replace the faulty Dual In-line Memory Module (DIMM) if the error persists.

    • Ensure the server BIOS/Firmware is updated to the latest vendor-recommended version, as firmware updates often include improved memory training algorithms.

  4. Once the host is back online and hardware repairs are complete, navigate to the vSAN Cluster > Monitor > vSAN > Skyline Health to ensure all data objects are healthy and redundant.

Additional Information

Correctable errors are often early indicators of imminent DIMM failure. Ignoring these alerts can lead to uncorrectable memory errors, resulting in a system crash and potential data unavailability within a vSAN environment.