"Failed to put node <UUID> in maintenance mode. Please retry" NSX Manager upgrade failure
search cancel

"Failed to put node <UUID> in maintenance mode. Please retry" NSX Manager upgrade failure

book

Article ID: 396371

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX Manager upgrade failed with
    "Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI."
  • NSX Manager log /var/log/upgrade-coordinator/upgrade-coordinator.log has a log similar to this example
    "moduleName":"upgrade-coordinator","errorCode":30062,"errorMessage":"Unexpected error while upgrading upgrade unit: Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI."
  • On the NSX Manager admin cli

    > get group maintenance-mode status

    Group Type: CONTROLLER
    Members:
        UUID           Leadership Work Completed              Group Update Ack Received              Maintenance Mode Status
        <UUID1>               True                                   False                           MAINTENANCE_MODE_FAILED
        <UUID2>               True                                   True                            MAINTENANCE_MODE_OFF
        <UUID3>               True                                   True                            MAINTENANCE_MODE_OFF
     
  • In the NSX Manager log, /var/log/cloudnet/nsx-ccp.log

    • A timeout exception is seen repeating

      <Date>T00:26:46.828Z  WARN CommonDelayedScheduler CorfuClusterService 72670 - [nsx@6876 comp="nsx-controller" level="WARNING" subcomp="corfu-cluster"] Ack not sent to GMLE due to java.util.concurrent.TimeoutException
       
    • The log line "Voting algorithm finished for revision" is also present however "Notify listeners for sharding update with revision" is not logged after it

      <DATE>T00:15:23.706Z  INFO DeltaSyncSubscriber-6-1 VotingAlgorithm 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="cluster"] Voting algorithm finished for revision 549755813889
      <DATE>T00:15:23.706Z  INFO ForkJoinPool.commonPool-worker-17 ShardingManagerImpl 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="magpie"] Notify listeners for sharding update with revision <ID>

Environment

  • VMware NSX 4.2.0.x
  • VMware NSX 4.2.1.x
  • VMware NSX 9.0.0

Cause

Due to an issue in JDK (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), Java ForkJoinPool may incorrectly determine the total number of ForkJoinPool threads as over the limit and new thread requests may be blocked, which results in the NSX Controller transaction processing thread being blocked.
There are two possible scenarios:

  • Scenario #1: The Controller service is impacted however it auto restarts after 2 hours to self recover. 
  • Scenario #2: ForkJoinPool.commonPool may become blocked and the Controller service cannot recover without a manual restart. 

Note, this issue is expected to repeat based on the uptime of the Controller service. Medium form factor Managers can experience the issue after 6 weeks and Large/Extra Large form factor Managers after more than 3 months.

Resolution

Additional Information