Application has crashed on NSX manager due to upgrade coordinator out of memory
search cancel

Application has crashed on NSX manager due to upgrade coordinator out of memory

book

Article ID: 389094

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX Manager reports an "Application has crashed" alarm.
  • This issue can also occur on a Local Manager or a Global Manager in a Federation setup.
  • There will be a "uc_oom.hprof" core dump file on the NSX Manager in /image/core.
  • APIs to Upgrade Coordinator service may fail due to high memory usage, or service being out of memory. 
  • Upgrade-coordinator log file will indicate multiple long running threads (indicated by "seconds=xxxxxxx"):
    /var/log/upgrade-coordinator/upgrade-coordinator.log
    2025-02-15T09:19:59.215Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########0ead]: long running tx has been running for seconds=7317265, numTxnAccess=1
    2025-02-15T09:19:59.727Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########7748]: long running tx has been running for seconds=7478925, numTxnAccess=1
    2025-02-15T09:19:59.728Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########99ae]: long running tx has been running for seconds=8263421, numTxnAccess=1
    2025-02-15T09:19:59.728Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########568d]: long running tx has been running for seconds=7316784, numTxnAccess=1

    2025-02-15T09:19:59.729Z WARN tx-tracer-poller UfoTxnTracingService 121432 SYSTEM [nsx@6876 comp="global-manager" level="WARNING" subcomp="upgrade-coordinator"] UfoTxnTracingService[id=########-####-####-####-########6187]: long running tx has been running for seconds=7402426, numTxnAccess=1
  • Over 200 long running transactions can be observed in the same log file, e.g.:
    awk -F 'id=' '{print $2}' upgrade-coordinator*.log | awk -F ']' '{print $1}' | tr -s \\n | sort | uniq | wc -l
    257
  • Out of memory error can be observed in upgrade coordinator's tomcat-wrapper log:
    /var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log
    INFO | jvm 1 | 2025/02/05 01:30:41 | java.lang.OutOfMemoryError: Java heap space
    STATUS | wrapper | 2025/02/05 01:30:41 | The JVM has run out of memory. Requesting thread dump.
    STATUS | wrapper | 2025/02/05 01:30:41 | Dumping JVM state.
    STATUS | wrapper | 2025/02/05 01:30:41 | The JVM has run out of memory. Restarting JVM.
    INFO | jvm 1 | 2025/02/05 01:30:41 | Dumping heap to /image/core/uc_oom.hprof ...
    INFO | jvm 1 | 2025/02/05 01:30:42 | 2025-02-05 01:30:42
    INFO | jvm 1 | 2025/02/05 01:30:42 | Full thread dump OpenJDK 64-Bit Server VM (11.0.23+10-LTS mixed mode):
  • In the same log file, you may observe high number of "http-nio-" threads, which are blocked on ForkJoinPool task, e.g.:
    INFO | jvm 1 | 2025/03/19 14:35:31 | "http-nio-127.0.0.1-7442-exec-151" #93999 daemon prio=5 os_prio=0 cpu=3.78ms elapsed=1244108.54s tid=0x00006e0cec08e000 nid=0x2d833e in Object.wait() [0x00006e0c8eda3000]
    INFO | jvm 1 | 2025/03/19 14:35:31 | java.lang.Thread.State: WAITING (on object monitor)
    INFO | jvm 1 | 2025/03/19 14:35:31 | at java.lang.Object.wait([email protected]/Native Method)
    INFO | jvm 1 | 2025/03/19 14:35:31 | - waiting on <no object reference available>
    INFO | jvm 1 | 2025/03/19 14:35:31 | at java.util.concurrent.ForkJoinTask.externalAwaitDone([email protected]/Unknown Source)
  • Same behaviour may be observed on all NSX Managers, but not necessarily at the same time. 
  • This issue is likely to manifest on Large/Extra Large Managers with uptime greater than 3 months and on Medium Managers with uptime higher than 6 weeks.

Environment

  • VMware NSX 4.2.0.x
  • VMware NSX 4.2.1.3 and earlier
  • VMware NSX 9.0.0

Cause

Upgrade coordinator service fails due to out of memory as a result of large number of threads in "WAITING" state (threads will fill up the heap). This occurs due to a known JDK issue (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow).

Resolution

Additional Information