DFW vMotion Failure Alarm due to NSX Controller being in a bad state

Products

VMware vDefend Firewall VMware NSX

Issue/Introduction

'DFW vMotion Failure' NSX alarms are observed when attempting to vMotion a VM to another NSX-prepared host

The DFW vMotion for DFW filter nic-######-eth0-vmware-sfw.2 on destination host <hostname> has failed and the port for the entity has been disconnected.

Controller, Manager, and NSX agents connectivity and status appear normal.
The data dump is missing on the NSX Manager support bundle.
Manually generating the data dump also fails.

DFW rules are not applied to the VM(s) as the vmotion fails.

The following messages/errors were observed in the ESXi host logs:

/var/run/log/nsx-syslog:

2025-03-06T20:37:57.086Z In(182) nestdb-server[2100043]: NSX 2100043 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="2100043" level="INFO"] ########-####-####-####-############ disconnected, Unsubscribing corresponding subscriptions <<<==== (Disconnect between CCP and nestDB)

....

2025-03-06T20:44:00.319Z In(182) cfgAgent[2099972]: NSX 2099972 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="97444700" level="info"] dfw: DfwMsgCache: msg to update: sect(0, 0), rule(0, 0), cont(0, 0), group(0, 0)
.....

2025-03-06T20:44:00.319Z In(182) cfgAgent[2099972]: NSX 2099972 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="97444700" level="info"] dfw: DfwMsgCache: Update done. 0 sections, 0 rules, 0 containers, 0 lsp group relations <<<==== (No configuration received from the controller

...

2025-03-06T20:50:00.326Z In(182) cfgAgent[2099972]: NSX 2099972 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="96C34700" level="info"] dfw: No matched appliedto for filter nic-######-eth0-vmware-sfw.2 of vif ########-####-####-####-###########, timeout reached. Clearing import flag
104501:2025-03-06T20:50:00.326Z Er(179) cfgAgent[2099972]: NSX 2099972 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="96C34700" level="error" errorCode="LCP01106"] dfw: Failed to apply rule config to filter nic-######-eth0-vmware-sfw.2 of vif ########-####-####-####-############: invalid argument
104509:2025-03-06T20:50:00.653Z Er(179) cfgAgent[2099972]: NSX 2099972 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-monitoring" entId="########-####-####-####-########" tid="97DD7700" level="fatal" eventState="On" eventFeatureName="distributed_firewall" eventSev="critical" eventType="dfw_vmotion_failure"] The DFW vMotion for DFW filter nic-######-eth0-vmware-sfw.2 on destination host <hostname> has failed and the port for the entity has been disconnected.

/var/run/log/vmkernel.log:

2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)Restore state called for filter nic-######-eth0-vmware-sfw.2

....
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)pfr_attach_table: nic-######-eth0-vmware-sfw.2: ERROR ***************** local root table ########-####-####-####-############ not found
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)pfr_attach_table: nic-######-eth0-vmware-sfw.2: ERROR ***************** local root table ########-####-####-####-############ not found
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)pfioctl: DIOCADDRULE failed with error 2

.....

2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)failed to import single ruleset: Failure
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)IMPORTTLVRULES failed: 195887105
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)Sending message to cfgAgent to raising alarm for filter import failure
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu54:2105351)VMotion: 6709: 2805539076395193744 D: Received all changed pages.
2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)Failed to restore datapath state : Failure

.....

2025-03-06T20:49:18.958Z In(182) vmkernel: cpu4:2098032)DVFilter: 1756: Bringing down port due to failed DVFilter state restoration and failPolicy of FAIL_CLOSED.

The following messages were observed in the NSX Manager logs:

/var/log/cloudnet/nsx-ccp-wrapper.errput:

INFO | jvm 1 | 2025/03/11 01:52:31 | "ForkJoinPool.commonPool-worker-23" #101 daemon prio=5 os_prio=0 cpu=184.72ms elapse
d=272925.10s tid=0x000071c######800 nid=0x15f0 waiting on condition [0x0000#######4a000] <<<=== "elapsed=272925.10s" indicates a few ForkJoinPool.commonPool threads have been in waiting or blocked state for >3 days, thus Controller service could not function normally.
INFO | jvm 1 | 2025/03/11 01:52:31 | java.lang.Thread.State: TIMED_WAITING (parking)
INFO | jvm 1 | 2025/03/11 01:52:31 | at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
INFO | jvm 1 | 2025/03/11 01:52:31 | - parking to wait for <0x0000#####b65298> (a java.util.concurrent.ForkJoinPo
ol)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.locks.LockSupport.parkUntil([email protected]/Unknown
Source)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.ForkJoinPool.runWorker([email protected]/Unknown Sourc
e)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.ForkJoinWorkerThread.run([email protected]/Unknown Source)
INFO | jvm 1 | 2025/03/11 01:52:31 |
INFO | jvm 1 | 2025/03/11 01:52:31 | "ConnectionKeeperImpl-1-3" #102 daemon prio=5 os_prio=0 cpu=3.53ms elapsed=272925.09s tid=0x000071c620010800 nid=0x15f1 waiting on condition [0x0000######a1e000]
INFO | jvm 1 | 2025/03/11 01:52:31 | java.lang.Thread.State: WAITING (parking)
INFO | jvm 1 | 2025/03/11 01:52:31 | at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
INFO | jvm 1 | 2025/03/11 01:52:31 | - parking to wait for <0x000######d52e00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/Unknown Source)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/Unknown Source)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/Unknown Source)
INFO | jvm 1 | 2025/03/11 01:52:31 | at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/Unknown Source)

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.x
VMware NSX 9.0.0

Cause

Due to JDK bug (JDK-8330017), Java ForkJoinPool may incorrectly determine the total number of ForkJoinPool threads as over the limit and new thread requests may be blocked, which results in the NSX Controller transaction processing thread being blocked. In other scenarios, ForkJoinPool.commonPool may become blocked and the Controller service cannot recover without a manual restart. For more information.
Note, this issue is expected to repeat based on the uptime of the Controller service. Medium form factor Managers can experience the issue after 6 weeks and Large/Extra Large form factor Managers after more than 3 months.

Resolution

For resolution and workaround see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.