"Failed to put node <UUID> in maintenance mode. Please retry" NSX Manager upgrade failure

search cancel

"Failed to put node <UUID> in maintenance mode. Please retry" NSX Manager upgrade failure

book

Article ID: 396371

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX Manager upgrade failed with
"Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI."
NSX Manager log /var/log/upgrade-coordinator/upgrade-coordinator.log has a log similar to this example
"moduleName":"upgrade-coordinator","errorCode":30062,"errorMessage":"Unexpected error while upgrading upgrade unit: Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI."
On the NSX Manager admin cli
> get group maintenance-mode status
Group Type: CONTROLLER
Members:
UUID Leadership Work Completed Group Update Ack Received Maintenance Mode Status
<UUID1> True False MAINTENANCE_MODE_FAILED
<UUID2> True True MAINTENANCE_MODE_OFF
<UUID3> True True MAINTENANCE_MODE_OFF
In the NSX Manager log, /var/log/cloudnet/nsx-ccp.log
- A timeout exception is seen repeating
  
  <Date>T00:26:46.828Z WARN CommonDelayedScheduler CorfuClusterService 72670 - [nsx@6876 comp="nsx-controller" level="WARNING" subcomp="corfu-cluster"] Ack not sent to GMLE due to java.util.concurrent.TimeoutException
- The log line "Voting algorithm finished for revision" is also present however "Notify listeners for sharding update with revision" is not logged after it
  
  <DATE>T00:15:23.706Z INFO DeltaSyncSubscriber-6-1 VotingAlgorithm 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="cluster"] Voting algorithm finished for revision 549755813889
  <DATE>T00:15:23.706Z INFO ForkJoinPool.commonPool-worker-17 ShardingManagerImpl 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="magpie"] Notify listeners for sharding update with revision <ID>

Environment

VMware NSX 4.2.0.x
VMware NSX 4.2.1.3 and earlier
VMware NSX 9.0.0

Cause

Due to an issue in JDK (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), Java ForkJoinPool may incorrectly determine the total number of ForkJoinPool threads as over the limit and new thread requests may be blocked, which results in the NSX Controller transaction processing thread being blocked.
There are two possible scenarios:

Scenario #1: The Controller service is impacted however it auto restarts after 2 hours to self recover.
Scenario #2: ForkJoinPool.commonPool may become blocked and the Controller service cannot recover without a manual restart.

Note, this issue is expected to repeat based on the uptime of the Controller service. Medium form factor Managers can experience the issue after 6 weeks and Large/Extra Large form factor Managers after more than 3 months.

Resolution

For resolution and workaround see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.

Additional Information

Feedback

thumb_up Yes

thumb_down No