NSX Manager GUI is Unavailable, tanuki.log reports jvm running out of memory

Products

VMware NSX-T Data Center VMware NSX

Issue/Introduction

NSX-T GUI is inaccessible, but all services are running. The web GUI may also display the error below.
You might also not be able to run a get cluster status When you encounter this issue from the admin CLI.

Under /var/log/cbm/tanuki.log you should see the following log lines for the jvm in cbm in charge of compaction running out of memory.

tanuki.log.10:11457:STATUS | wrapper  | 2025/04/19 17:30:23 | The JVM has run out of memory.  Requesting thread dump.
tanuki.log.10:11459:STATUS | wrapper  | 2025/04/19 17:30:23 | The JVM has run out of memory.  Restart JVM (Ignoring, already restarting).
tanuki.log.10:13289:STATUS | wrapper  | 2025/04/19 17:30:49 | The JVM has run out of memory.  Requesting thread dump.
tanuki.log.10:13291:STATUS | wrapper  | 2025/04/19 17:30:49 | The JVM has run out of memory.  Restarting JVM.
tanuki.log.10:14986:STATUS | wrapper  | 2025/04/19 17:30:53 | The JVM has run out of memory.  Requesting thread dump.
tanuki.log.10:14988:STATUS | wrapper  | 2025/04/19 17:30:53 | The JVM has run out of memory.  Restart JVM (Ignoring, already restarting).
tanuki.log.10:16753:STATUS | wrapper  | 2025/04/19 17:30:53 | The JVM has run out of memory.  Requesting thread dump.
tanuki.log.10:16755:STATUS | wrapper  | 2025/04/19 17:30:53 | The JVM has run out of memory.  Restart JVM (Ignoring, already restarting).
tanuki.log.10:18543:STATUS | wrapper  | 2025/04/19 17:31:25 | The JVM has run out of memory.  Requesting thread dump.
tanuki.log.10:18545:STATUS | wrapper  | 2025/04/19 17:31:25 | The JVM has run out of memory.  Restarting JVM.

In /var/log/syslog you can see this as well with the log line below.

<Time Stamp>  NSX 19643 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="cbm"] Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space

Under /var/log/cbm/cbm.log, you will see reports of services being down, even if they report as being up, when you run 'get cluster status' as admin.

<Time Stamp> ERROR HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 92085 - [nsx@6876 comp="nsx-manager" errorCode="HBS153" level="ERROR" s2comp="service-monitor" subcomp="cbm"] One or more services are down: [Epoch:2]CLUSTER_MANAGER:UNKNOWN,SM:DOWN,MONITORING:DOWN,AR:DOWN,MESSAGING_MANAGER:DOWN,PROTON:DOWN,CONTROLLER:DOWN,IDPS_REPORTING:DOWN,SEARCH:DOWN,CM_INV:DOWN,HTTP:DOWN

In /image/core You might also see CBM core dumps depending on how long the service has been crashing.
Validate, in corfu-compactor-audit.log & corfu-compactor-leader.log to see if compaction is still running. If compaction is not running gracefully, reboot all three NSX managers, and compaction will restart once the managers are back up.

Environment

VMware NSX

Cause

If the environment experiences prolonged spikes in network or storage latency or spikes in cpu usage on the host during the compaction process, this process might take a longer period of time to complete, causing cbm service to run out of memory. Restarting the managers, will clear memory, and kick off a new compaction request.

Resolution

Wait until the compaction process completes after the reboot.
You can observe this process by tailing in either /var/log/corfu/corfu-compactor-leader.log or /var/log/corfu/corfu-compactor-audit.log.

This is a scenario where it can take a while, depending on the size of the environment.
/var/log/corfu/corfu-compactor-leader.log completion

2025-04-18T17:29:29.771Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | DynamicTriggerPolicy: Trigger as elapsedTime 902 > safeTrimPeriod 900
2025-04-18T17:29:29.898Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Trim completed, elapsed(0s), log address up to 2989733883.
2025-04-18T17:29:29.898Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | =============Initiating Distributed Compaction============
2025-04-18T17:29:29.978Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Init compaction cycle is successful. Min token 2989778336
2025-04-19T17:49:04.782Z | INFO  |         CorfuServer-shutdown-4 |               compactor-leader | Compactor Orchestrator service shutting down.
2025-04-19T17:52:11.981Z | INFO  |       initializationTaskThread |               compactor-leader | Starting Compaction service...
2025-04-19T17:52:22.203Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | getNewCorfuRuntime: Corfu Runtime connected successfully
2025-04-19T17:53:10.733Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | invokeCheckpointing: hostName: (NSX Manager IPs), port: 9000
2025-04-19T17:53:10.757Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Triggered compactor jvm
2025-04-19T18:00:17.268Z | INFO  |              Cmpt-9000-chkpter |               compactor-leader | Shutting down existing checkpointer jvm
2025-04-19T18:29:15.257Z | ERROR |                       Thread-6 |               compactor-leader | Exception occurred while getting ErrorStream:

/var/log/corfu/corfu-compactor-audit.log completion

2025-04-19T19:14:20.148Z | INFO  |              Cmpt-chkpter-9000 |   org.corfudb.util.FileWatcher | Closed FileWatcher.
2025-04-19T19:14:20.148Z | INFO  |                  FileWatcher-0 |   org.corfudb.util.FileWatcher | FileWatcher failed to poll file /config/cluster-manager/corfu/private/keystore.jks, Exception: java.nio.file.ClosedWatchServiceException., isStopped: true
2025-04-19T19:14:20.148Z | INFO  |                  FileWatcher-0 |   org.corfudb.util.FileWatcher | Watch service is stopped. Skip reloading new watch service.
2025-04-19T19:14:20.150Z | WARN  |                        netty-0 |      o.c.r.c.NettyClientRouter | userEventTriggered: unhandled event SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)
2025-04-19T19:14:20.151Z | WARN  |                        netty-2 |      o.c.r.c.NettyClientRouter | userEventTriggered: unhandled event SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)
2025-04-19T19:14:20.151Z | WARN  |                        netty-1 |      o.c.r.c.NettyClientRouter | userEventTriggered: unhandled event SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)
2025-04-19T19:14:20.160Z | INFO  |              Cmpt-chkpter-9000 |    o.c.c.CompactorCheckpointer | Exiting CorfuStoreCompactor
2025-04-19T19:14:20.274Z  INFO Runner - Finished running corfu compactor tool.

If you believe you have encountered this issue, open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.