NSX-T VMs connectivity issues and Transport nodes controller connectivity status is UNKOWN

Products

VMware NSX

Issue/Introduction

Symptoms:

The cluster status shows as up and stable when you run: get cluster status
The Transport nodes show as connected in the Fabric screen.
In the Overview screen for System -> Fabric -> Nodes -> Edge or Host Transport nodes, the Controller Connectivity shows as UNKNOWN.
Tunnels to these Transport nodes show as DOWN also.
DFW rule publishing may fail due to this issue.
Other CLI commands such as get nodes, get services may fail.
You have NSX Intelligence installed.
In the NSX-T manager proton-tomcat-wrapper.log we see:

Exception in thread "ForkJoinPool.commonPool-worker-4" java.lang.OutOfMemoryError: unable to create new native thread
The JVM has run out of memory. Requesting thread dump.
Dumping JVM state.
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1486)
at java.util.concurrent.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1517)
at java.util.concurrent.ForkJoinPool.deregisterWorker(ForkJoinPool.java:1609)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:167)
Exception in thread "ForkJoinPool.commonPool-worker-11" java.lang.OutOfMemoryError: unable to create new native thread
The JVM has run out of memory. Requesting thread dump.

In the NSX-T manager nsxapi log we see a lot of events like the following, for example 2 in 3 seconds :

INFO intelligence-alarm-start-stop EventSource 8004 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting EventSource

If we do a thread dump, we can see a very large number of threads for the EventReportProcessor.java process in the proton-tomcat-wrapper.log, like the following:

INFO | jvm 1 | 2021/03/17 12:55:21 | "pool-9971-thread-1" #83259 prio=5 os_prio=0 tid=0x0000725d04fb2800 nid=0x514 waiting on condition [0x0000725b6177d000]
INFO | jvm 1 | 2021/03/17 12:55:21 | java.lang.Thread.State: WAITING (parking)
INFO | jvm 1 | 2021/03/17 12:55:21 | at sun.misc.Unsafe.park(Native Method)
INFO | jvm 1 | 2021/03/17 12:55:21 | - parking to wait for <0x0000725d3bf01140> (a java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.get(FutureTask.java:191)
INFO | jvm 1 | 2021/03/17 12:55:21 | at com.vmware.nsx.monitoring.clientlibrary.core.EventReportProcessor$1.run(EventReportProcessor.java:94)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.run(FutureTask.java:266)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
INFO | jvm 1 | 2021/03/17 12:55:21 | at java.lang.Thread.run(Thread.java:748)

Environment

VMware NSX-T Data Center

Cause

There is a memory leak which occurs when certain events are called and not closed correctly.
This leak causes the proton service to go out of memory and crash on an NSX-T manager.
This manager is the one the Transport node is connected to with UNKNOWN status, which means the host can not get any further updates, this can lead to VM connectivity issues.

Resolution

This issue is resolved in NSX-T 3.1.2 available at VMware Downloads.

Workaround:
Restart the proton service on the impacted NSX manager.
Or uninstall NSX Intelligence.