Unexpected error while upgrading upgrade unit. Failed to Exit node <manager uuid> from maintenance mode. Please retry the operation

search cancel

Unexpected error while upgrading upgrade unit. Failed to Exit node <manager uuid> from maintenance mode. Please retry the operation

book

Article ID: 372936

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

You are upgrading your NSX-T from 3.2.x or 4.1.x version to 3.2.x or 4.1.x version
During the upgrade host and edge upgrade get's completed successfully
During the manager upgrade, 1 or more than 1 manager node fails.
Upgrade status in the NSX-T UI will stuck in "In Progress"
If you expand the Sequence number 3. Node OS Upgrade in NSX-T UI -> Upgrade tab then you can see that manager node reaches 90% but eventually fails
In the upgrade details tab, you can see it is throwing "Failed" status with the following error,

"Failed to exit node <Manager UUID> from maintenance mode. Please retry the operation"

NSX Managers upgrade stopped with Failed status with the error "NSX Managers upgrade has failed, check error details to determine if manual resolution is needed and 'Retry Upgrade'."

Environment

VMware NSX

VMware NSX-T Data Center

Cause

SSH into the manager node which is failing and execute the following command in admin mode,

nsxmgr1> get group maintenance-mode status

Group Type: <name of the service>
Members:
UUID Leadership Work Completed Group Update Ack Received Maintenance Mode Status
<Manager 1 UUID> True False MAINTENANCE_MODE_FAILED
<Manager 2 UUID> False False MAINTENANCE_MODE_OFF
<Manager 3 UUID> True True MAINTENANCE_MODE_OFF

Note: Command "get group maintenance-mode status" needs to be entered manually as this command would not auto-complete.

All the fields must be True and the Maintenance mode status should be "MAINTENANCE_MODE_OFF" for all three managers.
If any node is showing the status of "MAINTENANCE_MODE_FAILED" then check the "Group Update Ack Received" that will be in "False" state.
This is caused when the CCP messages are not received in correct time.

Resolution

Workaround 1

SSH into all three manager nodes as root user,

Restart the NSX Central Control Plane(nsx-ccp) service on all three manager nodes
Command : /etc/init.d/nsx-ccp restart
Wait for 10 mins and check the output of "get group maintenance-mode status"
If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
Else, call the following APIs (you can use postman)
POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr1-uuid>?action=maintenance_mode_off
POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr2-uuid>?action=maintenance_mode_off
POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr3-uuid>?action=maintenance_mode_off
Wait for 10 mins and check the output of "get group maintenance-mode status"
If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
Restart the NSX Central Control Plane(nsx-ccp) service on all three manager nodes
Command : /etc/init.d/nsx-ccp restart
Wait for 10 mins and check the output of "get group maintenance-mode status"
If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
Else log a Broadcom Support ticket and involve support team.
Go to Upgrade UI, and continue the MP upgrade.
NOTE: The upgrade should explicitly be resumed from NSX UI in this case and no CLI command must be used to resume.

Workaround 2

If the manager node is stuck in the reboot then we can see the below log messages in var/log/syslog,

---snip---

reboot.target: Job reboot.target/start timed out.
<Date>T<Time>Z <hostname> systemd 1 - - Timed out starting Reboot.
<Date>T<Time>Z <hostname> systemd 1 - - reboot.target: Job reboot.target/start failed with result

--snip---

In such scenarios, rebooting the manager manually and perform the above workaround 1.

Resolution:

This is a known issue impacting VMware NSX.

Additional Information

In case you see the error as "Management Plane node failed to enter maintenance mode" Please refer below KB.
During NSX upgrade the Management Plane node failed to enter maintenance mode

Feedback

thumb_up Yes

thumb_down No