NSX Manager node(s) become unresponsive and the VM console shows errors like "BUG: soft lockup - CPU#<##> stuck for <##>s!..."
search cancel

NSX Manager node(s) become unresponsive and the VM console shows errors like "BUG: soft lockup - CPU#<##> stuck for <##>s!..."

book

Article ID: 399356

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Rebooting the affected NSX Manager VM does not resolve the issue.
  • In the NSX Manager UI from a working node, the cluster will show it is in a degraded state under System/Appliances.
  • A VM console on the affected VM shows the CPU stuck for x seconds messages like in the screenshot below.
  • The following log is observed from the /var/log/kern.log file of the impacted NSX manager VM

    <Time-stamp> <NSX-Manager-hostname> kernel - - - [588132.585048] watchdog: BUG: soft lockup - CPU#20 stuck for 10s! [Log4j2-TF-1-Asy:8284]
    <Time-stamp> <NSX-Manager-hostname> kernel - - - [588132.585051] watchdog: BUG: soft lockup - CPU#6 stuck for 9s!

Environment

  • VMware NSX 4.x
  • VMware NSX-T Data Center 3.x

 

Cause

High CPU usage on the ESXi host where the NSX Manager VM is running.

Resolution

The NSX Manager functionality can be restored by moving the VM to a different, healthy host and rebooting it. Verify the original host is healthy and not overprovisioned or over-utilized before moving any NSX VM back to it.

Additional Information

Note, that the snapshots of NSX Manager Appliance VMs should not be taken as explained in the following doc Disable Snapshots on an NSX Appliance. However, because the CPU lockup state can also come about due to snapshot quiesce actions, make sure that no snapshots are present for the NSX manager VMs. 

Refer also to the vSphere ESXi article, Error: "kernel: BUG: soft lockup - CPU#Y stuck for Xs" within VM, regarding the same condition for VMs generally, not specific to NSX environments.