"Some appliance components are not functioning properly." Error displayed when attempting to log in to NSX manager due to JDK impacting corfu compactor

Products

VMware NSX

Issue/Introduction

A user is unable to access the NSX UI and unable to login. The user may not be able to access all NSX Manager nodes via the web.
The following or similar alerts are displayed on the browser when trying to access NSX Manager UI or NSX Global Manager UI:

Some appliance components are not functioning properly.
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101

The HTTP return code and browser tab may indicate HTTP 503.
The following error is returned when the 'get cluster status' command is run on the manager CLI:

<Manager Name> get cluster status
% An error occurred while getting the cluster status

Application Crashed alarm may be present for all three NSX Managers:
cbm_oom.hprof dump files can be found in the /image/core directory on all three NSX Managers:

-rw------- 1 nsx-cbm nsx-cbm 250M Jan 1 hr:mn cbm_oom.hprof

The following log entries about CBM service running out of memory can be seen in the NSX manager log /var/log/cbm/cbm.log

WARN DistributedLockMonitorThread DistributedLockMonitorImpl 74346 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="distributed-lock-monitor" subcomp="cbm"] Exception in distributed lock monitor java.lang.OutOfMemoryError: Java heap space
WARN GmleRpcService:worker-0 SingleThreadEventExecutor 2721271 Unexpected exception from an event executor:java.lang.OutOfMemoryError: Java heap space
ERROR ClusteringRpcServer-Heartbeat-Thread1 HeartbeatServiceImpl 2682901 - [nsx@6876 comp="nsx-manager" errorCode="HBS101" level="ERROR" s2comp="heartbeat-service" subcomp="cbm"] RPC failed on method UpdateHeartbeat.
java.lang.OutOfMemoryError: Java heap space

The corfu compactor service has not run for a long time, which typically runs every 15 minutes, to confirm if have completed:
- Login to NSX Manager as root user and review the /var/log/corfu/corfu-compactor-audit.log to see the last messages. Review the timestamp and when the corfu compactor last finished:

<Timestamp> | INFO | Cmpt-chkpter-9000 | o.c.c.CompactorCheckpointer | Exiting CorfuStoreCompactor
<Timestamp> INFO Runner - Finished running corfu compactor tool.

- Running the command below will create a list of time stamps for the instances that the above entries are matched to in the logs. This can be used to more conveniently identify when the compactor service stopped running:

grep -ihE "Exiting CorfuStoreCompactor|Finished running corfu compactor tool" /var/log/corfu/corfu-compactor-audit* | cut -d':' -f 1,2 | sort| uniq

- Log lines similar to the below are encountered on the NSX Manager in the /var/log/corfu/tanuki.log, which indicate in the Corfu thread dump that the compactor leader got hung at 'Fork Join Pool'

The steps outlined in Troubleshooting NSX Datastore (CorfuDB) Issues to check the /config/corfu/LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers, shows all outputs to be identical.
On the NSX manager as the root user, running df -h shows the /config partition is above 1% and in most cases, well above it, more than 10%.
This issue is also seen in Federation environment, where Global and Local Managers would be affected
Historical disk space stats can be found reading /var/log/stats/sys_disk.stats on the NSX Manager(s), you can track the usage of /config and see when it has grown above 1%.
- The % usage can be listed in descending order running the below command as root user on the NSX manager(s):
  
  grep -iE "\/dev\/mapper\/nsx-config .*%.*\/config" /var/log/stats/sys_disk.stats | sort -r -k 5

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.3 and earlier
VMware NSX 9.0.0

Cause

Due to a known issue in JDK, (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), if there is no corfu compactor leader change for a long time, the corfu compactor leader may become unresponsive and not trigger compaction cycles, unless Corfu server is restarted. This leads to CBM service running out of memory and crashing.

Resolution

For resolution, see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.

To workaround this issue, do a rolling reboot of the NSX managers, as per the KB NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow

Then monitor the /config directory using the command df -h as root user, the compaction process may take some time, let this run for 24 hours.

If the /config goes down to around 1% across all nodes and the UI is accessible again, as admin user run 'get cluster status' and confirm all clustered services are in an UP state.

If after 24 hours the /config has not decreased or has increased in size, open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Additional Information

Troubleshooting NSX Datastore (CorfuDB) Issues

After the managers have been restarted, an Application Crashed alarm may be present. The application crashed alarm(s) can be resolved by removing the core dump files from the respective nodes following the resolution steps in the following KB: Application on NSX node has crashed alarm