NSX Manager is inaccessible via UI on NSX 4.1.2.x versions
search cancel

NSX Manager is inaccessible via UI on NSX 4.1.2.x versions

book

Article ID: 377530

calendar_today

Updated On:

Products

VMware NSX VMware NSX Networking

Issue/Introduction

  • The environment is running VMware NSX 4.1.2.x 
  • NSX UI inaccessible using the VIP IP / NSX Managers IP (directly).
  • NSX managers are still reachable on network via ping/SSH.
  • NSXCLI commands may return "internal error"
  • At the time of the issue all services on all the NSX manager nodes are reporting down.

/var/log/cbm/cbm.logs:

2024-08-27T00:31:25.224Z ERROR HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 84198 - [nsx@6876 comp="nsx-manager" errorCode="HBS153" level="ERROR" s2comp="service-monitor" subcomp="cbm"] One or more services are down: [Epoch: 26]SEARCH:DOWN,AR:DOWN,PROTON:DOWN,CLUSTER_MANAGER:DOWN,MONITORING:UP,CM_INV:DOWN,CONTROLLER:DOWN,IDPS_REPORTING:DOWN,MESSAGING_MANAGER:DOWN,SM:DOWN,HTTP:DOWN
  • Internal applications/services on NSX Manager (ex: proton, cluster-boot-manager etc) which connects to Corfu (Database) and depends on it, reporting disconnection with corfu. "retry" counter getting increased to reconnect with corfu. The services may also restart to retry the communication.  

ex: For proton  

/var/log/proton/nsxapi.log

2024-08-27T00:58:50.779Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 44 times, SystemDownHandlerTriggerLimit = 60
2024-08-27T00:59:06.820Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 45 times, SystemDownHandlerTriggerLimit = 60
2024-08-27T00:59:22.862Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 46 times, SystemDownHandlerTriggerLimit = 60
...
2024-08-27T00:45:59.862Z WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler-worker-3 DataStoreDisconnectHandler 85708 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Disconnected from the database, restarting the service

Similar observations reported for Cluster-Boot-Manager (CBM).

/var/log/cbm/cbm.log

2024-08-27T00:31:12.672Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 3 times, SystemDownHandlerTriggerLimit = 90
2024-08-27T00:31:28.709Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 4 times, SystemDownHandlerTriggerLimit = 90

2024-08-27T00:46:43.122Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 61 times, SystemDownHandlerTriggerLimit = 90
2024-08-27T00:46:59.156Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 62 times, SystemDownHandlerTriggerLimit = 90

 

  • 'corfu-server' is reporting "SERVER_ERROR" indicating why the internal applications (ex: proton, cbm) are not able to connect to it.

/var/log/corfu/corfu.9000.log

2024-08-27T00:31:23.517Z | WARN | client-1 | o.c.r.c.ClientResponseHandler | Server threw exception for SERVER_ERROR with request_id: 1714247

 

  • Runtime errors observed due to time-out during corfu related tasks.

/var/log/corfu/corfu-compactor-leader.log:

2024-08-27T00:36:41.581Z | ERROR | Cmpt-9000-chkpter | compactor-leader | Exception in runOrchestrator():
    java.lang.RuntimeException: java.util.concurrent.TimeoutException
            at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:71)
          at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:105)

 

  • In some cases, it may also be seen that the internal applications (ex: proton, cbm) reporting Out of Memory errors eventually.

ex: for cbm

/var/log/cbm/tanuki.log

INFO   | jvm 1037 | 2024/08/27 00:51:02 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2024/08/27 00:51:02 | Dumping JVM state.
STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Restarting JVM.

 

Environment

VMware NSX 

Cause

In some scenarios when internal corfu-runtime instances and threads are unable to communicate with the corfu-server side instance and those threads don't get cleaned up properly, leads to corfu being down and eventually to Out Of Memory situations. This eventually brings all the other services(ex: Proton, CBM etc.) to go down as well as they depend upon corfu.

Resolution

This is a known issue in the current 4.1.2.x release. The fix for this has been rolled out in NSX version 4.2.0 GA and later versions. 

Workaround:

  • When the issue is hit, reboot all 3 managers or restart corfu-server service on all 3 of them (from root shell: /etc/init.d/corfu-server restart

Additional Information

NSX 4.2.0 Release notes: Release Notes