NSX Manager services are down and vCenter Server reports High CPU usage for NSX Manager

Products

VMware NSX

Issue/Introduction

NSX manager's services goes down when running on a specific ESXi host.
Due to this NSX manager cluster lands in Degraded state.
Alarm "Virtual Machine CPU usage" is triggered on the vCenter Server for the affected NSX Manager node.

On the ESXi host where the affected NSX Manager node is running, a negative available CPU capacity is reported.

A timeout is observed in /var/log/syslog when Manager nodes tries to fetch layout:

YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 3487 - - Tried to get layout from <nsx_manager_1_ip>:9000 but failed by timeout
YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 6236 - - Tried to get layout from <nsx_manager_2_ip>:9000 but failed by timeout
YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 3487 - - Tried to get layout from <nsx_manager_3_ip>:9000 but failed by timeout

layoutHelper reports the following lines in /var/log/syslog:

YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 3487 - - layoutHelper: System seems unavailable
YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 6236 - - layoutHelper: System seems unavailable
YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 3826 - - layoutHelper: System seems unavailable
YYYY-MM-DDTHH:MM:SS.SSSZ <nsx_manager_fqdn> NSX 3826 - - message repeated 3 times: [layoutHelper: System seems unavailable]

/var/log/corfu/corfu.9000.log reports Timeout Exception:

YYYY-MM-DDTHH:MM:SS.SSSZ | ERROR | failAfter-0 | o.c.i.LocalMonitoringService | Error requesting sequencer metrics:
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$OrApply.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
at org.corfudb.util.CFUtils.lambda$failAfter$0(CFUtils.java:118)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.TimeoutException: null
at org.corfudb.util.CFUtils.<clinit>(CFUtils.java:36)
at org.corfudb.runtime.clients.NettyClientRouter.sendRequestAndGetCompletable(NettyClientRouter.java:498)
at org.corfudb.runtime.clients.AbstractClient.sendRequestWithFuture(AbstractClient.java:43)
at org.corfudb.runtime.clients.LayoutClient.getLayout(LayoutClient.java:38)
at org.corfudb.runtime.CorfuRuntime.lambda$fetchLayout$6(CorfuRuntime.java:1295)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
... 3 common frames omitted

/var/log/cbm/cbm.log reports all the services are down:

YYYY-MM-DDTHH:MM:SS.SSSZ INFO HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 3826 - [nsx@6876 comp="nsx-manager" level="INFO" s2comp="service-monitor" subcomp="cbm"] New entity status: [Epoch: 20]SEARCH:DOWN,PROTON:DOWN,HTTP:DOWN,CM_INV:DOWN,IDPS_REPORTING:DOWN,SM:DOWN,MESSAGING_MANAGER:DOWN,CONTROLLER:DOWN,AR:DOWN,CLUSTER_MANAGER:DOWN,MONITORING:DOWN

From top command output, load average is reported over 100:

top - HH:MM:SS up 7 days, 6:01, 0 users, load average: 0.95, 1.17, 1.15
top - HH:MM:SS up 7 days, 6:02, 0 users, load average: 5.08, 2.00, 1.42
top - HH:MM:SS up 7 days, 6:03, 0 users, load average: 55.98, 16.68, 6.48
top - HH:MM:SS up 7 days, 6:04, 0 users, load average: 141.78, 49.64, 18.61
top - HH:MM:SS up 7 days, 6:05, 0 users, load average: 200.18, 80.05, 30.60
top - HH:MM:SS up 7 days, 6:06, 0 users, load average: 265.76, 121.19, 47.86

Environment

VMware NSX

VMware vSphere ESXi

Cause

A load average of more than 100 is critically high for a Guest Operating System. Applications cannot function reliably under such extreme contention. Negative available CPU capacity on the ESXi host indicates that the NSX Manager was completely starved of CPU cycles, leading to the services going down.

The Virtual Machine CPU usage alarm triggered on vSphere Client for NSX Manager Virtual Machine is a trailing symptom of the host's resource exhaustion.

NSX Manager services goes down because the ESXi host in unable to fulfill the resource demands of NSX Manager Virtual Machine.

Resolution

At the time of issue, please collect the following information and open a Broadcom Support Case and select the product VMware vSphere ESXi:

ESXi Support Bundle where the affected Virtual machine is running.
esxtop data by executing the following command on ESXi host

esxtop -b -a -d 2 -n 150 | gzip -9c > /vmfs/volumes/datastore_name/esxtop-Hostname.csv.gz

NSX Manager Support Bundle.

To bring the NSX Manager Virtual Machine back to functional state, vMotion it to another ESXi host which has abundant resources and reboot the VM.