NSX upgrade precheck fails due to OOM error and Upgrade Coordinator service crash
search cancel

NSX upgrade precheck fails due to OOM error and Upgrade Coordinator service crash

book

Article ID: 389145

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

In VMware NSX 4.1.x, upgrading to NSX 4.2.x. An error message shows while running precheck "upstream connect error or disconnect/reset before headers. reset reason connection failure, transport failure reason: delayed connect error:111"

NSX manager logs under /var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log shows crashes:

INFO   | jvm 1    | 2025/01/27 14:54:43 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | 2025/01/27 14:54:43 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2025/01/27 14:54:43 | Dumping JVM state.
STATUS | wrapper  | 2025/01/27 14:54:43 | The JVM has run out of memory.  Restarting JVM.
INFO   | jvm 1    | 2025/01/27 14:54:43 | Dumping heap to /image/core/uc_oom.hprof ...
INFO   | jvm 1    | 2025/01/27 14:54:43 | 2025-01-27 14:54:43
INFO   | jvm 1    | 2025/01/27 14:54:43 | Full thread dump OpenJDK 64-Bit Server VM (11.0.23+10-LTS mixed mode):
INFO   | jvm 1    | 2025/01/28 14:05:05 | Heap
INFO   | jvm 1    | 2025/01/28 14:05:05 |  par new generation   total 157248K, used 157173K [0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX)
INFO   | jvm 1    | 2025/01/28 14:05:05 |   eden space 139776K,  99% used [0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX)
INFO   | jvm 1    | 2025/01/28 14:05:05 |   from space 17472K,  99% used [0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX)
INFO   | jvm 1    | 2025/01/28 14:05:05 |   to   space 17472K,   0% used [0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX)
INFO   | jvm 1    | 2025/01/28 14:05:05 |  concurrent mark-sweep generation total 349568K, used 349568K [0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX, 0x0000XXXXXXXXXXXX)
INFO   | jvm 1    | 2025/01/28 14:05:05 |  Metaspace       used 222489K, capacity 225485K, committed 228392K, reserved 1247232K
INFO   | jvm 1    | 2025/01/28 14:05:05 |   class space    used 28316K, capacity 29406K, committed 30304K, reserved 1048576K

 

There may also be new core dumps generated on the Upgrade Coordinator node under /image/cores for core.uc_oom.<random_numbers>.hprof

When trying to dump the BaseHostSwitchProfile corfu table using the following command, there could also be an out-of-memory error:

corfu_tool_runner.py -n nsx -t BaseHostSwitchProfile -o showTable

Environment

VMware NSX

Cause

This is caused by a large quantity of UplinkHSProfiles in the Corfu database. These uplink profiles are not used, but are automatically created by a security-only workflow. The large number of entries in this database table contributes to a memory leak when the upgrade pre-check reads the table.

Resolution

This issue will be fixed in a future NSX version.

If you believe you have encountered this issue and are unable to upgrade, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.