INFO | jvm 3 | <DATE/TIME>
| #
INFO | jvm 3 | <DATE/TIME>
| # java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper | <DATE/TIME>
| The JVM has run out of memory. Requesting thread dump.
STATUS | wrapper | <DATE/TIME>
| Dumping JVM state.
STATUS | wrapper | <DATE/TIME>
| The JVM has run out of memory. Restart JVM (Ignoring, already restarting).
INFO | jvm 3 | <DATE/TIME>
| # -XX:OnOutOfMemoryError="gzip -f /image/core/uc_oom.hprof"
INFO | jvm 3 | <DATE/TIME>
| # Executing /bin/sh -c "gzip -f /image/core/uc_oom.hprof"...
A Upgrade Coordinator core dump is located on the NSX-T manager partition /image/core/:
-rw------- 1 uuc uuc 112M <DATE/TIME> uc_oom.hprof.gz
GET https://<NSX_IP>/api/v1/edge-clusters
"result_count": 46,<<<<<<<<<<<<
GET https://<NSX_IP>api/v1/fabric/compute-collections
"result_count": 2441,<<<<<<<<
VMware NSX-T 3.2.3, 4.0.x and 4.1.x and below.
In a scaled environment, during the NSX-T edge pre-check, the Upgrade Coordinator (UC) loads all the compute collections and matches it against the compute on which the edge is deployed. This workflow is executed in parallel processing to load all Edge clusters at the same time. This leads to the UC going out of memory and hence the edge pre-check fails.
This issue is resolved in VMware NSX 3.2.4.0 and 4.2.0, available at Broadcom downloads.
If you are having difficulty finding and downloading the software, please review the Download Broadcom products and software KB.
Workaround
As the pre checks are not running, we need to ensure there are no issues with the edges nodes, this can be done manually, by completing steps 1 to 5 below:
GET https://<mgrIp>/api/v1/transport-nodes/<nodeid>/status?source=realtime
GET https://<mgrIp>/api/v1/transport-nodes/<nodeid>/state
GET https://<mgrIp>/api/v1/edge-clusters/<edgeclusterId>/status?source=realtime
GET https://<mgrIp>/api/v1/alarms?status=OPEN&node_id=<nodeId>
POST https://<mgrIp>/api/v1/upgrade/plan?action=reset&component_type=EDGE
POST https://<mgrIp>/api/v1/upgrade/plan?component_type=EDGE&&action=start
After the edge nodes are upgraded using the workaround provided in the resolution section, the NSX Manager prechecks fail due to edge node prechecks being bypassed.
Due to the Edge prechecks being bypassed, the NSX Manager precheck also fails, as it depends on the completion of those Edge prechecks, which prevents the NSX Managers upgrade from starting.
Workaround for Manager prechecks failing:
We can manually run the precheck for NSX Manager Nodes using API and then trigger the upgrade using either API or NSX UI.
1.Run the NSX manager prechecks only, bypassing Edge and Host pre checks, using the API provided below:
POST https://<nsx-mgr-IP>/api/v1/upgrade?component_type=MP&&action=execute_pre_upgrade_checks
2. After you have ensured all checks are passed, you can then trigger the NSX manager upgrade from NSX UI or using the API provided below:
POST https://<nsx-mgr-IP>/api/v1/upgrade/plan?action=upgrade&component_type=MP.