NSX-T VMs connectivity issues and Transport nodes controller connectivity status is UNKOWN
book
Article ID: 322497
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
The cluster status shows as up and stable when you run: get cluster status
The Transport nodes show as connected in the Fabric screen.
In the Overview screen for System -> Fabric -> Nodes -> Edge or Host Transport nodes, the Controller Connectivity shows as UNKNOWN.
Tunnels to these Transport nodes show as DOWN also.
DFW rule publishing may fail due to this issue.
Other CLI commands such as get nodes, get services may fail.
You have NSX Intelligence installed.
In the NSX-T manager proton-tomcat-wrapper.log we see:
Exception in thread "ForkJoinPool.commonPool-worker-4" java.lang.OutOfMemoryError: unable to create new native thread The JVM has run out of memory. Requesting thread dump. Dumping JVM state. at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1486) at java.util.concurrent.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1517) at java.util.concurrent.ForkJoinPool.deregisterWorker(ForkJoinPool.java:1609) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:167) Exception in thread "ForkJoinPool.commonPool-worker-11" java.lang.OutOfMemoryError: unable to create new native thread The JVM has run out of memory. Requesting thread dump.
In the NSX-T manager nsxapi log we see a lot of events like the following, for example 2 in 3 seconds :
If we do a thread dump, we can see a very large number of threads for the EventReportProcessor.java process in the proton-tomcat-wrapper.log, like the following:
INFO | jvm 1 | 2021/03/17 12:55:21 | "pool-9971-thread-1" #83259 prio=5 os_prio=0 tid=0x0000725d04fb2800 nid=0x514 waiting on condition [0x0000725b6177d000] INFO | jvm 1 | 2021/03/17 12:55:21 | java.lang.Thread.State: WAITING (parking) INFO | jvm 1 | 2021/03/17 12:55:21 | at sun.misc.Unsafe.park(Native Method) INFO | jvm 1 | 2021/03/17 12:55:21 | - parking to wait for <0x0000725d3bf01140> (a java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.get(FutureTask.java:191) INFO | jvm 1 | 2021/03/17 12:55:21 | at com.vmware.nsx.monitoring.clientlibrary.core.EventReportProcessor$1.run(EventReportProcessor.java:94) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) INFO | jvm 1 | 2021/03/17 12:55:21 | at java.lang.Thread.run(Thread.java:748)
Environment
VMware NSX-T Data Center 3.1.1
Cause
There is a memory leak which occurs when certain events are called and not closed correctly. This leak causes the proton service to go out of memory and crash on an NSX-T manager. This manager is the one the Transport node is connected to with UNKNOWN status, which means the host can not get any further updates, this can lead to VM connectivity issues.
Resolution
This issue is resolved in NSX-T 3.1.2 and above, download available at Broadcom Downloads.
Workaround: Restart the proton service on the impacted NSX manager. Or uninstall NSX Intelligence.