Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI.
""moduleName":"upgrade-coordinator","errorCode":30062,"errorMessage":"Unexpected error while upgrading upgrade unit: Failed to put node <UUID> in maintenance mode. Please retry the operation after checking 'get group maintenance-mode status' CLI."
> get group maintenance-mode status
Group Type: CONTROLLER
Members:
UUID Leadership Work Completed Group Update Ack Received Maintenance Mode Status
<UUID1> True False MAINTENANCE_MODE_FAILED
<UUID2> True True MAINTENANCE_MODE_OFF
<UUID3> True True MAINTENANCE_MODE_OFF
<Date>T00:26:46.828Z WARN CommonDelayedScheduler CorfuClusterService 72670 - [nsx@6876 comp="nsx-controller" level="WARNING" subcomp="corfu-cluster"] Ack not sent to GMLE due to java.util.concurrent.TimeoutException
<DATE>T00:15:23.706Z INFO DeltaSyncSubscriber-6-1 VotingAlgorithm 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="cluster"] Voting algorithm finished for revision 549755813889
<DATE>T00:15:23.706Z INFO ForkJoinPool.commonPool-worker-17 ShardingManagerImpl 73873 - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="magpie"] Notify listeners for sharding update with revision <ID>
Due to an issue in JDK (JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow), Java ForkJoinPool may incorrectly determine the total number of ForkJoinPool threads as over the limit and new thread requests may be blocked, which results in the NSX Controller transaction processing thread being blocked.
There are two possible scenarios:
Note, this issue is expected to repeat based on the uptime of the Controller service. Medium form factor Managers can experience the issue after 6 weeks and Large/Extra Large form factor Managers after more than 3 months.
For resolution and workaround see NSX is impacted by JDK-8330017: ForkJoinPool stops executing tasks due to ctl field Release Count (RC) overflow.