NSX is Impacted by JDK-8330017: ForkJoinPool Stops Executing Tasks Due to ctl Field Release Count (RC) Overflow

Products

VMware NSX

Issue/Introduction

Some VMware NSX versions (see "Environment" section) are affected by a critical JDK bug (JDK-8330017) where Java ForkJoinPool incorrectly determines the total number of threads as over the limit, causing new thread requests to be blocked. This results in NSX services' transaction processing threads becoming unresponsive.

When this occurs, one or more of these symptoms appear:

Edge and ESXi Host Transport Node may be in a degraded state after NSX upgrade
VM has no network connectivity due to a blocked port after power on or vmotion
DFW vMotion Failure due to NSX Controller being in a bad state
Application has crashed on NSX manager due to upgrade coordinator out of memory
"Failed to put node <UUID> in maintenance mode. Please retry" NSX Manager upgrade failure
"Some appliance components are not functioning properly." Error displayed when attempting to log in to NSX manager
NSX VM workloads impacted by "JDK ForkJoinPool issue" post ESXi patching/upgrade

NSX Upgrade JDK pre-check warning - NSX Manager reboot required

Pre-check warning for JDK-8330017 issue during NSX upgrade from 4.2.x to 4.2.1.4/4.2.2.x

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.2.1 and earlier
VMware NSX 9.0.0.0

Cause

The issue occurs due to a JDK bug (JDK-8330017) where the Release Count (RC) field in ForkJoinPool's internal control structure overflows. The RC value keeps decreasing until it reaches -32768, then overflows to +32767 (ForkJoinPool.MAX_CAP), causing the thread pool to stop executing tasks.

This affects different NSX services:

Controller service - impacts network provisioning, firewall rules, and vMotion operations
Upgrade Coordinator service - affects upgrade operations and causes OOM errors
Corfu service - impacts data storage and retrieval operations

The issue accumulates over time and becomes apparent during configuration changes (upgrades, VM migrations) or when memory limits are reached.

Resolution

This issue is resolved in VMware NSX 4.2.1.4, 4.2.2.0 and 9.0.1.0 and above, available at Broadcom downloads. If having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Broadcom recommends a rolling reboot of NSX Managers prior to upgrading to a fixed release version to avoid potential problems associated with this issue.

For environments running affected versions (see "Environment" section), implement a preventative monthly rolling reboot schedule:

Reboot the first NSX Manager.
SSH to a Manager as admin user and check cluster health: get cluster status
When all services report up on all 3 NSX Manager nodes, reboot the next Manager.
Repeat steps 2-3 for the third Manager.

In situations where tunnels on transport nodes are down due to this error and remain down following a rolling reboot of the managers, services on the transport nodes may need to be restarted in order to restore the tunnels. Please execute the following on affected hosts:

/etc/init.d/nsx-opsagent restart

/etc/init.d/nsx-proxy restart

Note: If experiencing this issue currently, restarting the affected service or rebooting the affected NSX Manager node resolves the immediate symptoms. However, without upgrading NSX (to a version where this issue is resolved), the problem will recur over time.