Java ForkJoinPool affecting multiple services such as Cluster Degradation, NSX UI inaccessibility or VM Network

Products

VMware NSX

Issue/Introduction

NSX UI may display Cluster Degraded alarm
Virtual Machines may lose connection when vMotioned to an ESXi host after exiting maintenance mode.
Newly created VMs, or recently migrated/vMotioned VMs do not connect to the network.
Only some of VMs on the same segment lose data plane connectivity; it can be resolved by the rolling reboot of NSX managers.
On the vSphere Client, Networking -> vDS Name -> Ports, the impacted VM port is in a "Blocked" state.

Log lines similar to the below are encountered on the ESXi host in /var/run/log/vmkernel.log

In(182) vmkernel: cpu81:126647407 opID=########)kcp: KCP_DeletePort:958: [nsx@6876 comp="nsx-esx" subcomp="kcp"]Port ###### is cleared and blocked

Output similar to the below is seen on the ESXi host in the output of net-dvs -l

        port ########-####-####-####-############:
                com.vmware.common.port.volatile.status = inUse linkUp blocked portID=###### Port blocked by admin propType = RUNTIME

New configuration changes, such as segment updates, policy updates, etc, are delayed or blocked.

vMotion of a VM may be blocked with an error on the vSphere client:

"Currently connected network interface" 'Network adapter X' uses network 'DVSwitch[## ## ## ## ## ## ## ##-## ## ## ## ## ## ## ##] NSX port group [dvportgroup-#####](nsxa down)', which is not accessible."

The connection from ESXi hosts to the Central Control Plane may briefly flap between two NSX Manager nodes during resharding, as part of the automatic recovery.

Log lines similar to the below are encountered on the ESXi host in /var/run/log/vmkernel.log

In(182) vmkernel: cpu1:2176110)vdl2: VDL2CPProcessLinkChange:6889: [nsx@6876 comp="nsx-esx" subcomp="vdl2-####"]Control plane link down[IP: ###.###.##.##] for VNI[####]

Alarms may be present in the NSX UI showing:

Control Channel To Transport Node Down

Control Channel To Transport Node Down Long

This condition may occur in two ways.

Scenario #1: The Controller service transaction processing thread is blocked but automatically self-recovers after 2 hours (7200 seconds).
Log lines similar to the below are encountered on the NSX Manager in /var/log/cloudnet/nsx-ccp.log

ERROR FalconThread-0 AbstractDependencyBasedDataDiscoverer 74130 - [nsx@6876 comp="nsx-controller" errorCode="CCP1310211" level="ERROR" subcomp="magpie"] Parallel invocation of features encountered error with concurrent listeners: {}
java.util.concurrent.TimeoutException: Shutdown timer hit after 7200 seconds

Log lines similar to the below are encountered on the NSX Manager in /var/log/cloudnet/nsx-ccp-events.log

EVENT WrapperSimpleAppMain Main 2200380 - [nsx@6876 comp="nsx-controller" level="EVENT" subcomp="main"] CCP process started

- Scenario #2: The Controller service service transaction processing thread is blocked indefinitely and does not self-recover.
  No log lines containing "ForkJoinPool.commonPool" are seen for 30 mins or longer on the NSX Manager in /var/log/cloudnet/nsx-ccp.log
```
INFO ForkJoinPool.commonPool-worker-63 ShardingManagerImpl 87947 - [nsx@4413 comp="nsx-controller" level="INFO" subcomp="magpie"] Notify listeners for sharding update with revision ########
```

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.x
VMware NSX 9.0.0

Cause

Due to an issue in JDK (JDK-8330017), Java ForkJoinPool may incorrectly determine the total number of ForkJoinPool threads as over the limit and new thread requests may be blocked, which results in the NSX Controller transaction processing thread being blocked.
There are two possible scenarios:

Scenario #1: The Controller service is impacted however it auto restarts after 2 hours to self recover. VMs requesting a new network connection after a vmotion or power on are impacted during this 2 hour window.
Scenario #2: ForkJoinPool.commonPool may become blocked and the Controller service cannot recover without a manual restart. VMs requesting a new network connection after a vmotion or power on are impacted until the issue is manually resolved.

Note: This issue is expected to repeat based on the uptime of the Controller service. Medium form factor Managers can experience the issue after 6 weeks and Large/Extra Large form factor Managers after more than 3 months.

Resolution

For resolution and workaround, refer to the parent article that consolidates guidance regarding this issue: NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.