"Unexpected error while upgrading upgrade unit. Failed to exit node <UUID> from maintenance mode" while Upgrading NSX Manager (Local or Global)

search cancel

"Unexpected error while upgrading upgrade unit. Failed to exit node <UUID> from maintenance mode" while Upgrading NSX Manager (Local or Global)

book

Article ID: 382149

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX management node upgrade fails with an error: "Unexpected error while upgrading upgrade unit. Failed to exit node <UUID> from maintenance mode. Please retry the operation."
In appliance CLI logged in as admin running get cluster status shows Group Type: MONITORING as STATUS DOWN
This issue can also be observed during new NSX Manager deployment via VCF Cloud Builder with Small size form factor.

Log lines similar to the below are encountered in /var/log/phonehome-coordinator/phonehome-coordinator-tomcat-wrapper.log:

INFO   | jvm 5    | Unable to create /image/core/phc_oom.hprof: File exists
INFO   | jvm 5    | Terminating due to java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | Dumping JVM state.
ERROR  | wrapper  | JVM exited unexpectedly.
STATUS | wrapper  | JVM process is gone.
STATUS | wrapper  | Launching a JVM.

A core dump file is present under path /image/core/phc_oom.hprof

You get similar output as below when you run get group maintenance-mode status:

Group Type: MANAGER
Members:
    UUID                                       Leadership Work Completed              Group Update Ack Received           Maintenance Mode Status
    ########-####-####-####-############       False                                  True                                MAINTENANCE_MODE_FAILED
    ########-####-####-####-############       True                                   True                                   MAINTENANCE_MODE_OFF
    ########-####-####-####-############       True                                   True                                   MAINTENANCE_MODE_OFF

Group Type: ASYNC_REPLICATOR
Members:
    UUID                                       Leadership Work Completed              Group Update Ack Received           Maintenance Mode Status
    ########-####-####-####-############       False                                 True                              MAINTENANCE_MODE_FAILED
    ########-####-####-####-############       True                                   True                                   MAINTENANCE_MODE_OFF
    ########-####-####-####-############       True                                   True                                   MAINTENANCE_MODE_OFF

Note: The above logs excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

The issue happens as a result of a race condition causes the Phonehome-coordinator (Monitoring) service fails to start due to "out of memory" issue and upgrade cannot be continued. The phonehome-coordinator service crashes and won't start because of out of a memory issue during initializing time.

Resolution

Below is a workaround:

If the NSX Manager VMs have been deployed as small form factor (which is not supported in production environments):
1. Power down the NSX Manager.
2. Increase the resources to a medium form factor or higher.
3. Power On the VM.
4. Wait and observe until all three NSX Managers are adjusted.
5. Repeat the same steps on the other two managers.

For more information, please check NSX Manager VM and Host Transport Node System Requirements

If the NSX Manager VMs are deployed as medium form factor or higher:
1. Reboot the impacted NSX Manager VM to resolve the "out of memory" issue.

Note: This issue is fixed in NSX 4.2.1.1 and later.

Additional Information

For additional information see Troubleshooting NSX Manager Upgrade Failures.

Feedback

thumb_up Yes

thumb_down No