/var/log/cbm/cbm.logs:
2024-08-27T00:31:25.224Z ERROR HeartbeatServiceServiceMonitorStatusUpdaterThread ServiceMonitor 84198 - [nsx@6876 comp="nsx-manager" errorCode="HBS153" level="ERROR" s2comp="service-monitor" subcomp="cbm"] One or more services are down: [Epoch: 26]SEARCH:DOWN,AR:DOWN,PROTON:DOWN,CLUSTER_MANAGER:DOWN,MONITORING:UP,CM_INV:DOWN,CONTROLLER:DOWN,IDPS_REPORTING:DOWN,MESSAGING_MANAGER:DOWN,SM:DOWN,HTTP:DOWN
ex: For proton
/var/log/proton/nsxapi.log
2024-08-27T00:58:50.779Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 44 times, SystemDownHandlerTriggerLimit = 60
2024-08-27T00:59:06.820Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 45 times, SystemDownHandlerTriggerLimit = 60
2024-08-27T00:59:22.862Z INFO WrapperStartStopAppMain AbstractView 3904309 layoutHelper: Retried 46 times, SystemDownHandlerTriggerLimit = 60
...
2024-08-27T00:45:59.862Z WARN org.corfudb.runtime.collections.streaming.StreamPollingScheduler-worker-3 DataStoreDisconnectHandler 85708 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Disconnected from the database, restarting the service
Similar observations reported for Cluster-Boot-Manager (CBM).
/var/log/cbm/cbm.log
2024-08-27T00:31:12.672Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 3 times, SystemDownHandlerTriggerLimit = 90
2024-08-27T00:31:28.709Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 4 times, SystemDownHandlerTriggerLimit = 90
…
2024-08-27T00:46:43.122Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 61 times, SystemDownHandlerTriggerLimit = 90
2024-08-27T00:46:59.156Z INFO DistributedLockMonitorThread AbstractView 84198 layoutHelper: Retried 62 times, SystemDownHandlerTriggerLimit = 90
/var/log/corfu/corfu.9000.log
2024-08-27T00:31:23.517Z | WARN | client-1 | o.c.r.c.ClientResponseHandler | Server threw exception for SERVER_ERROR with request_id: 1714247
/var/log/corfu/corfu-compactor-leader.log:
2024-08-27T00:36:41.581Z | ERROR | Cmpt-9000-chkpter | compactor-leader | Exception in runOrchestrator():
java.lang.RuntimeException: java.util.concurrent.TimeoutException
at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:71)
at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:105)
ex: for cbm
/var/log/cbm/tanuki.log
INFO | jvm 1037 | 2024/08/27 00:51:02 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper | 2024/08/27 00:51:02 | The JVM has run out of memory. Requesting thread dump.
STATUS | wrapper | 2024/08/27 00:51:02 | Dumping JVM state.
STATUS | wrapper | 2024/08/27 00:51:02 | The JVM has run out of memory. Restarting JVM.
VMware NSX
In some scenarios when internal corfu-runtime instances and threads are unable to communicate with the corfu-server side instance and those threads don't get cleaned up properly, leads to corfu being down and eventually to Out Of Memory situations. This eventually brings all the other services(ex: Proton, CBM etc.) to go down as well as they depend upon corfu.
This is a known issue in the current 4.1.2.x release. The fix for this has been rolled out in NSX version 4.2.0 GA and later versions.
Workaround:
/etc/init.d/corfu-server restart
) NSX 4.2.0 Release notes: Release Notes