NSX Bare Metal Edge stops working after an extended uptime
search cancel

NSX Bare Metal Edge stops working after an extended uptime

book

Article ID: 379021

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

 

NSX T Bare metal edge stopped working and reboot is required to bring the edge into operational state

NSX Bare Metal Edge has been working without issue for over 1000 days.
NSX Bare Metal Edge is using AMD processors
NSX Bare Metal Edge stops functioning and console access to the server shows log lines such as:

watchdog: BUG: soft lockup - CPU### stuck for ##s! [process_name:##]
audit backlog limit exceeded

Environment

VMware NSX T 3.x 

VMware NSX 4.x

Cause

Per Dell advisory BME may experience the issue with the following conditions

  • Processor is AMD EPYC 7532 32-Core Processor 
  • Bare metal edge is powered ON  more than 1044 days

Resolution

There are two workarounds referred by Dell in the referenced article

  1. Disable 'cstate' in BIOS to prevent the CPU core entering cc6 state.
  2. Reboot the system before it has an uptime of 1044 days. This could be a warm or cold reboot.
    1. An AMD CPU core may stop responding after about 1044 days according to AMD erratum 1474