ESXi system may crash or hang after 1044 days uptime
search cancel

ESXi system may crash or hang after 1044 days uptime

book

Article ID: 313165

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
A system using an EPYC 7002/7Fx2/7Hx2 Series CPU (codenamed Rome) or EPYC 7001 Series CPU (codenamed Naples) may crash or hang after approximately 1044 days of continuous uptime.

Cause

The issue is caused by AMD erratum 1474; please refer https://www.amd.com/system/files/TechDocs/56323-PUB_1.01.pdf .  If the CC6 (core C6) power saving state is enabled on an affected CPU, a core may fail to exit CC6 after about 1044 after the last system hardware reset.  Note that a reboot using VMware QuickBoot is not a system hardware reset.

Resolution

Currently there is no resolution to the issue. This will be fixed in the future release.

Workaround:

To workaround the issue, please follow anyone of the following:

  1. Perform a system hardware reset at least once every 1044 days.
  2. Disable CC6.  This may result in increased power usage.

Disable CC6 without a system hardware reset by running the attached Python script in the ESXi shell. Disabling CC6 is not persistent, so the user will need to run the script again after each full reset.

Alternatively, the machine may provide a way to persistently disable CC6 as a BIOS setup option.  Details on how to do this depend on the hardware vendor and cannot be provided in this article.


Attachments

disable_cc6_v2 get_app