NSX Manager is inaccessible via UI on NSX 4.1.2.x versions
search cancel

NSX Manager is inaccessible via UI on NSX 4.1.2.x versions

book

Article ID: 377530

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The environment is running VMware NSX 4.1.2.x.
  • NSX UI inaccessible using the VIP IP or individual NSX Managers IPs.
  • NSX Managers are reachable with ping and accesible via SSH.
  • NSXCLI commands may return "internal error".
  • All services on all NSX manager nodes are 'down'.

    /var/log/cbm/cbm.logs
    ERROR HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 84198 - [nsx@6876 comp="nsx-manager" errorCode="HBS153" level="ERROR" s2comp="service-monitor" subcomp="cbm"] One or more services are down: [Epoch: 26]SEARCH:DOWN,AR:DOWN,PROTON:DOWN,CLUSTER_MANAGER:DOWN,MONITORING:UP,CM_INV:DOWN,CONTROLLER:DOWN,IDPS_REPORTING:DOWN,MESSAGING_MANAGER:DOWN,SM:DOWN,HTTP:DOWN
  • NSX Manager services (ie. proton, cluster-boot-manager etc) are disconnected from the Corfu Database and connection retry counter is incrementing.
    NSX Manager services may also restart before attempting to reconnect to the Corfu Database.  

    Proton:  /var/log/proton/nsxapi.log
    INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 44 times, SystemDownHandlerTriggerLimit = 60
    INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 45 times, SystemDownHandlerTriggerLimit = 60
    INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 46 times, SystemDownHandlerTriggerLimit = 60
    ...
    WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler-worker-3 DataStoreDisconnectHandler 85708 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Disconnected from the database, restarting the service

    Cluster-Boot-Manager (CBM):  /var/log/cbm/cbm.log
    INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 3 times, SystemDownHandlerTriggerLimit = 90
    INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 4 times, SystemDownHandlerTriggerLimit = 90

    INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 61 times, SystemDownHandlerTriggerLimit = 90
    INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 62 times, SystemDownHandlerTriggerLimit = 90
     
  • 'corfu-server' is reporting "SERVER_ERROR" indicating why NSX Manager services are unable to connect to the Corfu Database:

    /var/log/corfu/corfu.9000.log
    WARN | client-1 | o.c.r.c.ClientResponseHandler | Server threw exception for SERVER_ERROR with request_id: 1714247
  • Runtime errors observed due to time-out during corfu related tasks.

    /var/log/corfu/corfu-compactor-leader.log:
    ERROR | Cmpt-9000-chkpter | compactor-leader | Exception in runOrchestrator():
    java.lang.RuntimeException: java.util.concurrent.TimeoutException
    at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:71)
    at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:105)
  • NSX Manager may report service 'Out of Memory errors:

    Cluster-Boot-Manager (CBM):  /var/log/cbm/tanuki.log
    INFO   | jvm 1037 | 2024/08/27 00:51:02 | java.lang.OutOfMemoryError: Java heap space
    STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Requesting thread dump.
    STATUS | wrapper  | 2024/08/27 00:51:02 | Dumping JVM state.
    STATUS | wrapper  | 2024/08/27 00:51:02 | The JVM has run out of memory.  Restarting JVM.

 

Environment

VMware NSX 

Cause

If internal corfu-runtime threads are unable to communicate with the corfu-server and those threads are not cleaned up properly, it can lead to the Corfu database down and eventually to Out Of Memory situations. This eventually brings all NSX Manager services (ie: Proton, CBM etc) down also as they depend on Corfu database connectivity.

Resolution

This issue is resolved in VMware NSX 4.2, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

 

Workaround:

  • When the issue is hit, reboot all 3 managers or restart corfu-server service on all 3 of them (from root shell: /etc/init.d/corfu-server restart

Additional Information