NSX Manager is inaccessible via UI on NSX 4.1.2.x versions

Products

VMware NSX

Issue/Introduction

The environment is running VMware NSX 4.1.2.x.
NSX UI inaccessible using the VIP IP or individual NSX Managers IPs.
NSX Managers are reachable with ping and accesible via SSH.
NSXCLI commands may return "internal error".

All services on all NSX manager nodes are 'down'.

/var/log/cbm/cbm.logs

ERROR HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 84198 - [nsx@6876 comp="nsx-manager" errorCode="HBS153" level="ERROR" s2comp="service-monitor" subcomp="cbm"] One or more services are down: [Epoch: 26]SEARCH:DOWN,AR:DOWN,PROTON:DOWN,CLUSTER_MANAGER:DOWN,MONITORING:UP,CM_INV:DOWN,CONTROLLER:DOWN,IDPS_REPORTING:DOWN,MESSAGING_MANAGER:DOWN,SM:DOWN,HTTP:DOWN

NSX Manager services (ie. proton, cluster-boot-manager etc) are disconnected from the Corfu Database and connection retry counter is incrementing.
NSX Manager services may also restart before attempting to reconnect to the Corfu Database.

Proton: /var/log/proton/nsxapi.log

INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 44 times, SystemDownHandlerTriggerLimit = 60
INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 45 times, SystemDownHandlerTriggerLimit = 60
INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 46 times, SystemDownHandlerTriggerLimit = 60
...
WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler-worker-3 DataStoreDisconnectHandler 85708 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Disconnected from the database, restarting the service

Cluster-Boot-Manager (CBM): /var/log/cbm/cbm.log

INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 3 times, SystemDownHandlerTriggerLimit = 90
INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 4 times, SystemDownHandlerTriggerLimit = 90
…
INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 61 times, SystemDownHandlerTriggerLimit = 90
INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 62 times, SystemDownHandlerTriggerLimit = 90

'corfu-server' is reporting "SERVER_ERROR" indicating why NSX Manager services are unable to connect to the Corfu Database:

/var/log/corfu/corfu.9000.log
```
WARN | client-1 | o.c.r.c.ClientResponseHandler | Server threw exception for SERVER_ERROR with request_id: 1714247
```

Runtime errors observed due to time-out during corfu related tasks.

/var/log/corfu/corfu-compactor-leader.log:

ERROR | Cmpt-9000-chkpter | compactor-leader | Exception in runOrchestrator():
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:71)
at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:105)

NSX Manager may report service 'Out of Memory errors:

Cluster-Boot-Manager (CBM): /var/log/cbm/tanuki.log

INFO   | jvm 1037 | 2024/08/27 00:51:02 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2024/08/27 00:51:02 | Dumping JVM state.
STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Restarting JVM.

Environment

VMware NSX

Cause

If internal corfu-runtime threads are unable to communicate with the corfu-server and those threads are not cleaned up properly, it can lead to the Corfu database down and eventually to Out Of Memory situations. This eventually brings all NSX Manager services (ie: Proton, CBM etc) down also as they depend on Corfu database connectivity.

Resolution

This issue is resolved in VMware NSX 4.2, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

When the issue is hit, reboot all 3 managers or restart corfu-server service on all 3 of them (from root shell: /etc/init.d/corfu-server restart)

Additional Information

VMware NSX 4.2.0 Release Notes