ESXi host reboots unexpectedly with "warning: PCPU ## didn't have a heartbeat" messages
search cancel

ESXi host reboots unexpectedly with "warning: PCPU ## didn't have a heartbeat" messages

book

Article ID: 394262

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • An ESXi host may reboot unexpectedly without generating a proper crash dump.
  • Before the reboot, vCenter Server reports multiple errors including "Host is not responding" and "Cannot synchronize host."
  • The vSphere HA system detects the host failure and restarts virtual machines on other hosts in the cluster.
  • After the reboot, the host returns to normal operation, but the unexpected reboot may recur if the underlying cause is not addressed.
  • The vmkernel logs show specific entries indicating this condition:
    WARNING: Heartbeat: 961: PCPU 40 didn't have a heartbeat for 5 seconds, timeout is 10, 1 IPIs sent; *may* be locked up.
    WARNING: Heartbeat: 961: PCPU 41 didn't have a heartbeat for 15 seconds, timeout is 10, 2 IPIs sent; *may* be locked up.
    

Environment

  • ESXi 7.0 and newer

Cause

The unexpected reboot is caused by hardware-level failures in specific physical CPU cores (PCPUs). When one or more physical CPU cores become unresponsive, the ESXi heartbeat monitoring system detects that these cores are not responding to Inter-Processor Interrupts (IPIs). After multiple failed attempts to communicate with the locked-up cores, the server experiences a fault condition that triggers a reboot.

These heartbeat failures are symptomatic of physical CPU hardware issues that cannot be resolved through software configuration changes.

Resolution

Since this is a hardware-related issue, the following steps should be taken:

  1. Place the affected host in maintenance mode to prevent workloads from being impacted if the issue recurs.

  2. Review the ESXi host vmkernel.log to confirm PCPU heartbeat failure messages.

  3. Contact your server hardware vendor to perform comprehensive hardware diagnostics

  4. Update server firmware:

    1. Check for and apply the latest BIOS updates for your server model.

    2. Update any related firmware components (chipset, management controllers).

    3. Apply any microcode updates available for your processor model.

  5. If the issue persists after firmware updates, work with your hardware vendor for resolution.

Additional Information

  • This issue specifically affects physical CPU cores and is distinct from virtual CPU (vCPU) scheduling issues.
  • The problem may appear intermittently, making it difficult to diagnose without examining logs shortly after the event.
  • The absence of a vmkernel crash dump is typical in this scenario because the CPU hardware issue prevents proper crash dump generation.
  • If you cannot immediately address the hardware problem, consider these temporary mitigations:
    • Keep the host in maintenance mode to prevent production workload disruption.
    • If the host must remain in production, adjust DRS settings to lower the host's priority for new VM placements.