Some appliance components are not functioning properly.
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101
cbm_oom.hprof
dump files can be found in the /image/core
directory on all three NSX Managers:
-rw------- 1 nsx-cbm nsx-cbm 250M Jan 1 hr:mn cbm_oom.hprof
The following log entries about CBM service running out of memory can be seen in the NSX manager log /var/log/cbm/cbm.log
WARN DistributedLockMonitorThread DistributedLockMonitorImpl 74346 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="distributed-lock-monitor" subcomp="cbm"] Exception in distributed lock monitor java.lang.OutOfMemoryError: Java heap space
WARN GmleRpcService:worker-0 SingleThreadEventExecutor 2721271 Unexpected exception from an event executor:java.lang.OutOfMemoryError: Java heap space
ERROR ClusteringRpcServer-Heartbeat-Thread1 HeartbeatServiceImpl 2682901 - [nsx@6876 comp="nsx-manager" errorCode="HBS101" level="ERROR" s2comp="heartbeat-service" subcomp="cbm"] RPC failed on method UpdateHeartbeat.
java.lang.OutOfMemoryError: Java heap space
The corfu compactor service has not run for a long time, which typically runs every 15 minutes, to confirm if have completed:
Login to NSX Manager as root user and review the /var/log/corfu/corfu-compactor-audit.log
to see the last messages. Review the timestamp and when the corfu compactor last finished:
<Timestamp> | INFO | Cmpt-chkpter-9000 | o.c.c.CompactorCheckpointer | Exiting CorfuStoreCompactor
<Timestamp> INFO Runner - Finished running corfu compactor tool.
Running the command below will create a list of time stamps for the instances that the above entries are matched to in the logs. This can be used to more conveniently identify when the compactor service stopped running:
grep -ihE "Exiting CorfuStoreCompactor|Finished running corfu compactor tool" /var/log/corfu/corfu-compactor-audit* | cut -d':' -f 1,2 | sort| uniq
Log lines similar to the below are encountered on the NSX Manager in the /var/log/corfu/tanuki.log, which indicate in the Corfu thread dump that the compactor leader got hung at 'Fork Join Pool'
INFO | jvm 1 | <Timestamp> | "Cmpt-9000-chkpter" #51 prio=5 os_prio=0 cpu=4993463.30ms elapsed=28943674.48s tid=0x000065877c0c7800 nid=0x129de in Object.wait() [0x000065875dfa5000]
INFO | jvm 1 | <Timestamp> | java.lang.Thread.State: WAITING (on object monitor)
INFO | jvm 1 | <Timestamp> | at java.lang.Object.wait([email protected]/Native Method)
INFO | jvm 1 | <Timestamp> | - waiting on <no object reference available>
INFO | jvm 1 | <Timestamp> | at java.util.concurrent.ForkJoinTask.externalAwaitDone([email protected]/Unknown Source)
INFO | jvm 1 | <Timestamp> | - waiting to re-lock in wait() <0x000065886e4cfaf0> (a java.util.concurrent.ForkJoinTask$AdaptedCallable)
The steps outlined in Troubleshooting NSX Datastore (CorfuDB) Issues to check the /config/corfu/LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers, shows all outputs to be identical.
On the NSX manager as the root user, running df -h shows the /config partition is above 1% and in most cases, well above it, most than 10%.
Historical disk space stats can be found reading /var/log/stats/sys_disk.stats on the NSX Manager(s), you can track the usage of /config and see when it has grown above 1%.
The % usage can be listed in descending order running the below command as root user on the NSX manager(s):
grep -iE "\/dev\/mapper\/nsx-config .*%.*\/config" /var/log/stats/sys_disk.stats | sort -r -k 5
Due to a known issue in JDK, (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), if there is no corfu compactor leader change for a long time, the corfu compactor leader may become unresponsive and not trigger compaction cycles, unless Corfu server is restarted. This leads to CBM service running out of memory and crashing.
For resolution, see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.
To workaround this issue, do a rolling reboot of the NSX managers, as per the KB NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow
Then monitor the /config directory using the command df -h as root user, the compaction process may take some time, let this run for 24 hours.
If the /config goes down to around 1% across all nodes and the UI is accessible again, as admin user run 'get cluster status' and confirm all clustered services are in an UP state.
If after 24 hours the /config has not decreased or has increased in size, open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.