Some appliance components are not functioning properly.Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.Error code: 101
get cluster status' command is run on the manager CLI:<Manager Name> get cluster status% An error occurred while getting the cluster status
cbm_oom.hprof dump files can be found in the /image/core directory on all three NSX Managers:-rw------- 1 nsx-cbm nsx-cbm 250M Jan 1 hr:mn cbm_oom.hprof
The following log entries about CBM service running out of memory can be seen in the NSX Manager log /var/log/cbm/cbm.logWARN DistributedLockMonitorThread DistributedLockMonitorImpl 74346 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="distributed-lock-monitor" subcomp="cbm"] Exception in distributed lock monitor java.lang.OutOfMemoryError: Java heap spaceWARN GmleRpcService:worker-0 SingleThreadEventExecutor 2721271 Unexpected exception from an event executor:java.lang.OutOfMemoryError: Java heap spaceERROR ClusteringRpcServer-Heartbeat-Thread1 HeartbeatServiceImpl 2682901 - [nsx@6876 comp="nsx-manager" errorCode="HBS101" level="ERROR" s2comp="heartbeat-service" subcomp="cbm"] RPC failed on method UpdateHeartbeat.java.lang.OutOfMemoryError: Java heap space
The corfu compactor service has not run for a long time, which typically runs every 15 minutes, to confirm if they have completed:
Login to NSX Manager as root user and review the /var/log/corfu/corfu-compactor-audit.log to see the last messages. Review the timestamp and when the corfu compactor last finished:<Timestamp> | INFO | Cmpt-chkpter-9000 | o.c.c.CompactorCheckpointer | Exiting CorfuStoreCompactor <Timestamp> INFO Runner - Finished running corfu compactor tool.
Running the command below will create a list of time stamps for the instances that the above entries are matched to in the logs. This can be used to more conveniently identify when the compactor service stopped running:
grep -ihE "Exiting CorfuStoreCompactor|Finished running corfu compactor tool" /var/log/corfu/corfu-compactor-audit* | cut -d':' -f 1,2 | sort| uniq
Log lines similar to the below are encountered on the NSX Manager in the /var/log/corfu/tanuki.log, which indicate in the Corfu thread dump that the compactor leader got hung at 'Fork Join Pool'
INFO | jvm 1 | <Timestamp> | "Cmpt-9000-chkpter" #51 prio=5 os_prio=0 cpu=4993463.30ms elapsed=28943674.48s tid=0x000065877c0c7800 nid=0x129de in Object.wait() [0x000065875dfa5000]INFO | jvm 1 | <Timestamp> | java.lang.Thread.State: WAITING (on object monitor)INFO | jvm 1 | <Timestamp> | at java.lang.Object.wait([email protected]/Native Method)INFO | jvm 1 | <Timestamp> | - waiting on <no object reference available>INFO | jvm 1 | <Timestamp> | at java.util.concurrent.ForkJoinTask.externalAwaitDone([email protected]/Unknown Source)INFO | jvm 1 | <Timestamp> | - waiting to re-lock in wait() <0x000065886e4cfaf0> (a java.util.concurrent.ForkJoinTask$AdaptedCallable)
The steps outlined in Troubleshooting NSX Datastore (CorfuDB) Issues to check the /config/corfu/LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers, shows all outputs to be identical.
On the NSX manager as the root user, running df -h shows the /config partition is above 1% and in most cases, well above it, more than 10%.
Historical disk space stats can be found by reading /var/log/stats/sys_disk.stats on the NSX Manager(s). You can track the usage of /config and see when it has grown above 1%.
The % usage can be listed in descending order by running the below command as the root user on the NSX manager(s):
grep -iE "\/dev\/mapper\/nsx-config .*%.*\/config" /var/log/stats/sys_disk.stats | sort -r -k 5Due to a known issue in JDK, (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), if there is no corfu compactor leader change for a long time, the corfu compactor leader may become unresponsive and not trigger compaction cycles, unless Corfu server is restarted. This leads to CBM service running out of memory and crashing.
For resolution details, see the KB article: NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.
Perform a rolling reboot of the NSX Manager nodes as outlined in the KB article above.
Note: Depending on the level of Corfu impact, the
get cluster statuscommand may fail or return incomplete information due to the Cluster Boot Manager (CBM) service not being healthy. In this scenario, proceed with rebooting all three NSX Manager VMs regardless of theget cluster statusoutput.
After the rolling reboot is complete, monitor the /config partition usage on each NSX Manager node using one of the following commands:
As root user:
df -hAs admin user:
get filesystem-stats
Allow the compaction process to run for up to 24 hours. During this time, monitor whether the /config partition utilization decreases.
Once the /config partition usage decreases to approximately 1% across all NSX Manager nodes and UI access has been restored, verify cluster health by running the following command as the admin user:
get cluster status
Confirm that all clustered services report an UP status.
If, after 24 hours, the /config partition usage has not decreased or continues increasing, open a support case with Broadcom Support and reference this KB article.
For additional information, see Creating and managing Broadcom support cases.
Troubleshooting NSX Datastore (CorfuDB) Issues
After the managers have been restarted, an Application Crashed alarm may be present. The application crashed alarm(s) can be resolved by removing the core dump files from the respective nodes following the resolution steps in the following KB: Application on NSX node has crashed alarm
Note:
If the rolling reboot does not resolve the issue, it may be necessary to perform an additional one or two rolling reboot cycles. If UI access is still unavailable and services continue failing to start, please open a case with Broadcom Support for further assistance and troubleshooting.