NSX Manager UI Error Code 101 "Some appliance components are not functioning properly" due to JDK bug impacting Corfu compactor

Products

VMware NSX

Issue/Introduction

A user is unable to access the NSX GUI and cannot log in. The user may not be able to access all/one of the NSX Manager nodes via the web.

The following or similar alerts are displayed on the browser when trying to access NSX Manager UI or NSX Global Manager UI:

Some appliance components are not functioning properly. 
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101

The HTTP return code and browser tab may indicate HTTP 503.
The following error is returned when the 'get cluster status' command is run on the manager CLI:
```
<Manager Name> get cluster status
% An error occurred while getting the cluster status
```
Application Crashed alarm may be present for all three NSX Managers:

Application on NSX Node <node name> has crashed. The number of core files found is #. Collect the Support Bundle including core dump files and contact VMware Support Team.

cbm_oom.hprof dump files can be found in the /image/core directory on all three NSX Managers:
-rw------- 1 nsx-cbm nsx-cbm 250M Jan 1 hr:mn cbm_oom.hprof
The following log entries about CBM service running out of memory can be seen in the NSX Manager log /var/log/cbm/cbm.log
WARN DistributedLockMonitorThread DistributedLockMonitorImpl 74346 - [nsx@6876 comp="nsx-manager" level="WARNING" s2comp="distributed-lock-monitor" subcomp="cbm"] Exception in distributed lock monitor java.lang.OutOfMemoryError: Java heap space
WARN GmleRpcService:worker-0 SingleThreadEventExecutor 2721271 Unexpected exception from an event executor:java.lang.OutOfMemoryError: Java heap space
ERROR ClusteringRpcServer-Heartbeat-Thread1 HeartbeatServiceImpl 2682901 - [nsx@6876 comp="nsx-manager" errorCode="HBS101" level="ERROR" s2comp="heartbeat-service" subcomp="cbm"] RPC failed on method UpdateHeartbeat.
java.lang.OutOfMemoryError: Java heap space
The corfu compactor service has not run for a long time, which typically runs every 15 minutes, to confirm if they have completed:
- Login to NSX Manager as root user and review the /var/log/corfu/corfu-compactor-audit.log to see the last messages. Review the timestamp and when the corfu compactor last finished:
  <Timestamp> | INFO | Cmpt-chkpter-9000 | o.c.c.CompactorCheckpointer | Exiting CorfuStoreCompactor
  <Timestamp> INFO Runner - Finished running corfu compactor tool.

- Running the command below will create a list of time stamps for the instances that the above entries are matched to in the logs. This can be used to more conveniently identify when the compactor service stopped running:

grep -ihE "Exiting CorfuStoreCompactor|Finished running corfu compactor tool" /var/log/corfu/corfu-compactor-audit* | cut -d':' -f 1,2 | sort| uniq

- Log lines similar to the below are encountered on the NSX Manager in the /var/log/corfu/tanuki.log, which indicate in the Corfu thread dump that the compactor leader got hung at 'Fork Join Pool'

The steps outlined in Troubleshooting NSX Datastore (CorfuDB) Issues to check the /config/corfu/LAYOUT_CURRENT.ds Epoch number across all 3 NSX-T Managers, shows all outputs to be identical.
Alarms for high and very high /config partition disk usage might also be seen.
On the NSX manager as the root user, running df -h shows the /config partition is above 1% and in most cases, well above it, more than 10%.
This issue is also seen in the Federation environment, where Global and Local Managers would be affected
Historical disk space stats can be found by reading /var/log/stats/sys_disk.stats on the NSX Manager(s). You can track the usage of /config and see when it has grown above 1%.
- The % usage can be listed in descending order by running the below command as the root user on the NSX manager(s):
```
grep -iE "\/dev\/mapper\/nsx-config .*%.*\/config" /var/log/stats/sys_disk.stats | sort -r -k 5
```

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.3 and earlier
VMware NSX 9.0.0

Cause

Due to a known issue in JDK, (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), if there is no corfu compactor leader change for a long time, the corfu compactor leader may become unresponsive and not trigger compaction cycles, unless Corfu server is restarted. This leads to CBM service running out of memory and crashing.

Resolution

For resolution details, see the KB article: NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.

Workaround Procedure

Perform a rolling reboot of the NSX Manager nodes as outlined in the KB article above.

Note: Depending on the level of Corfu impact, the get cluster status command may fail or return incomplete information due to the Cluster Boot Manager (CBM) service not being healthy. In this scenario, proceed with rebooting all three NSX Manager VMs regardless of the get cluster status output.

After the rolling reboot is complete, monitor the /config partition usage on each NSX Manager node using one of the following commands:
- As root user:
```
df -h
```
- As admin user:
```
get filesystem-stats
```
Allow the compaction process to run for up to 24 hours. During this time, monitor whether the /config partition utilization decreases.
Once the /config partition usage decreases to approximately 1% across all NSX Manager nodes and UI access has been restored, verify cluster health by running the following command as the admin user:
```
get cluster status
```
Confirm that all clustered services report an UP status.

Additional Guidance

If, after 24 hours, the /config partition usage has not decreased or continues increasing, open a support case with Broadcom Support and reference this KB article.

Additional Information

Troubleshooting NSX Datastore (CorfuDB) Issues

After the managers have been restarted, an Application Crashed alarm may be present. The application crashed alarm(s) can be resolved by removing the core dump files from the respective nodes following the resolution steps in the following KB: Application on NSX node has crashed alarm

Note:

If the rolling reboot does not resolve the issue, it may be necessary to perform an additional one or two rolling reboot cycles. If UI access is still unavailable and services continue failing to start, open a case with Broadcom Support for further assistance and troubleshooting.