Application has crashed on NSX manager due to upgrade coordinator out of memory

Products

VMware NSX

Issue/Introduction

NSX Manager reports an "Application has crashed" alarm.
This issue can also occur on a Local Manager or a Global Manager in a Federation setup.
There will be a "uc_oom.hprof" core dump file on the NSX Manager in /image/core.
APIs to Upgrade Coordinator service may fail due to high memory usage, or service being out of memory.

Upgrade-coordinator log file will indicate multiple long running threads (indicated by "seconds=xxxxxxx"):

/var/log/upgrade-coordinator/upgrade-coordinator.log
2025-02-15T09:19:59.215Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########0ead]: long running tx has been running for seconds=7317265, numTxnAccess=1
2025-02-15T09:19:59.727Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########7748]: long running tx has been running for seconds=7478925, numTxnAccess=1
2025-02-15T09:19:59.728Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########99ae]: long running tx has been running for seconds=8263421, numTxnAccess=1
2025-02-15T09:19:59.728Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########568d]: long running tx has been running for seconds=7316784, numTxnAccess=1

2025-02-15T09:19:59.729Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########6187]: long running tx has been running for seconds=7402426, numTxnAccess=1

Over 200 long running transactions can be observed in the same log file, e.g.:

awk -F 'id=' '{print $2}' upgrade-coordinator*.log | awk -F ']' '{print $1}' | tr -s \\n | sort | uniq | wc -l
257

Out of memory error can be observed in upgrade coordinator's tomcat-wrapper log:

/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log
INFO | jvm 1 | 2025/02/05 01:30:41 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper | 2025/02/05 01:30:41 | The JVM has run out of memory. Requesting thread dump.
STATUS | wrapper | 2025/02/05 01:30:41 | Dumping JVM state.
STATUS | wrapper | 2025/02/05 01:30:41 | The JVM has run out of memory. Restarting JVM.
INFO | jvm 1 | 2025/02/05 01:30:41 | Dumping heap to /image/core/uc_oom.hprof ...
INFO | jvm 1 | 2025/02/05 01:30:42 | 2025-02-05 01:30:42
INFO | jvm 1 | 2025/02/05 01:30:42 | Full thread dump OpenJDK 64-Bit Server VM (11.0.23+10-LTS mixed mode):

In the same log file, you may observe high number of "http-nio-" threads, which are blocked on ForkJoinPool task, e.g.:

INFO | jvm 1 | 2025/03/19 14:35:31 | "http-nio-127.0.0.1-7442-exec-151" #93999 daemon prio=5 os_prio=0 cpu=3.78ms elapsed=1244108.54s tid=0x00006e0cec08e000 nid=0x2d833e in Object.wait() [0x00006e0c8eda3000]
INFO | jvm 1 | 2025/03/19 14:35:31 | java.lang.Thread.State: WAITING (on object monitor)
INFO | jvm 1 | 2025/03/19 14:35:31 | at java.lang.Object.wait([email protected]/Native Method)
INFO | jvm 1 | 2025/03/19 14:35:31 | - waiting on <no object reference available>
INFO | jvm 1 | 2025/03/19 14:35:31 | at java.util.concurrent.ForkJoinTask.externalAwaitDone([email protected]/Unknown Source)

Same behaviour may be observed on all NSX Managers, but not necessarily at the same time.
This issue is likely to manifest on Large/Extra Large Managers with uptime greater than 3 months and on Medium Managers with uptime higher than 6 weeks.

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.3 and earlier
VMware NSX 9.0.0

Cause

Upgrade coordinator service fails due to out of memory as a result of large number of threads in "WAITING" state (threads will fill up the heap). This occurs due to a known JDK issue (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow).

Resolution

For resolution and workaround see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.

Application has crashed on NSX manager due to upgrade coordinator out of memory

Article ID: 389094

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Additional Information

Feedback