Unexpected error while upgrading upgrade unit. Failed to Exit node <manager uuid> from maintenance mode. Please retry the operation
search cancel

Unexpected error while upgrading upgrade unit. Failed to Exit node <manager uuid> from maintenance mode. Please retry the operation

book

Article ID: 372936

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

  • You are upgrading your NSX-T from 3.2.x  or 4.1.x version to 3.2.x or 4.1.x version
  • During the upgrade host and edge upgrade get's completed successfully
  • During the manager upgrade, 1 or more than 1 manager node fails.
  • Upgrade status in the NSX-T UI will stuck in "In Progress"
  • If you expand the Sequence number 3. Node OS Upgrade in NSX-T UI -> Upgrade tab then you can see that manager node reaches 90% but eventually fails
  • In the upgrade details tab, you can see it is throwing "Failed" status with the following error,

"Failed to exit node <Manager UUID> from maintenance mode. Please retry the operation"

  • NSX Managers upgrade stopped with Failed status with the error "NSX Managers upgrade has failed, check error details to determine if manual resolution is needed and 'Retry Upgrade'."

 

Environment

VMware NSX

VMware NSX-T Data Center

Cause

 

  • SSH into the manager node which is failing and execute the following command in admin mode,

nsxmgr1> get group maintenance-mode status

Group Type: <name of the service>
Members:
       UUID                           Leadership Work Completed     Group Update Ack Received        Maintenance Mode Status
<Manager 1 UUID>                        True                         False                             MAINTENANCE_MODE_FAILED
<Manager 2 UUID>                        False                       False                              MAINTENANCE_MODE_OFF
<Manager 3 UUID>                        True                        True                               MAINTENANCE_MODE_OFF

Note: Command "get group maintenance-mode status" needs to be entered manually as this command would not auto-complete. 

  • All the fields must be True and the Maintenance mode status should be "MAINTENANCE_MODE_OFF" for all three managers.
  • If any node is showing the status of "MAINTENANCE_MODE_FAILED" then check the "Group Update Ack Received" that will be in "False" state. 
  • This is caused when the CCP messages are not received in correct time. 

Resolution

Workaround 1

SSH into all three manager nodes as root user,

  1. Restart the NSX Central Control Plane(nsx-ccp) service on all three manager nodes 
    Command : /etc/init.d/nsx-ccp restart
  2. Wait for 10 mins and check the output of "get group maintenance-mode status"
    If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
  3. Else, call the following APIs (you can use postman)
    POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr1-uuid>?action=maintenance_mode_off
    POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr2-uuid>?action=maintenance_mode_off
    POST https://<nsx-mgr>/api/v1/cluster-manager/nodes/<nsx-mgr3-uuid>?action=maintenance_mode_off

  4. Wait for 10 mins and check the output of "get group maintenance-mode status"
    If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
  5. Restart the NSX Central Control Plane(nsx-ccp) service on all three manager nodes 
    Command : /etc/init.d/nsx-ccp restart
  6. Wait for 10 mins and check the output of "get group maintenance-mode status"
    If the output of "get group maintenance-mode status" shows "True" for all the parameters, Go to step (8)
  7. Else log a Broadcom Support ticket and involve support team. 
  8. Go to Upgrade UI, and continue the MP upgrade.
    NOTE: The upgrade should explicitly be resumed from NSX UI in this case and no CLI command must be used to resume.

Workaround 2

If the manager node is stuck in the reboot then we can see the below log messages in var/log/syslog,

---snip---

reboot.target: Job reboot.target/start timed out.
<Date>T<Time>Z <hostname> systemd 1 - - Timed out starting Reboot.
<Date>T<Time>Z <hostname> systemd 1 - - reboot.target: Job reboot.target/start failed with result 

--snip---

In such scenarios, rebooting the manager manually and perform the above workaround 1. 

Resolution:

This is a known issue impacting VMware NSX.

Additional Information

In case you see the error as "Management Plane node failed to enter maintenance mode" Please refer below KB.
During NSX upgrade the Management Plane node failed to enter maintenance mode