All NES servers appear as unreachable, even though they are up and running
search cancel

All NES servers appear as unreachable, even though they are up and running

book

Article ID: 409246

calendar_today

Updated On:

Products

CA Release Automation - Release Operations Center (Nolio) CA Release Automation - DataManagement Server (Nolio)

Issue/Introduction

When checking the state if NES machines, it’s possible to see that all of them appear as unreachable. Checking logs on NAC, the following message can be seen that confirms the state of all NES:

 

[ExecutionServerStatusTask-31056] DEBUG (com.nolio.platform.server.dataservices.services.execmng.ExecutionServerStatusMonitor$ExecutionServerStatusTask:92) - Update execution servers status

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer1] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer2] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer3] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer4] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer5] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer6] is unreachable.

[ExecutionServerStatusTask-31056] INFO  (com.nolio.platform.server.dataservices.services.execmng.CheckExecutionConnectivityImpl:128) - ES [es_nesServer7] is unreachable.

.

.

Environment

Release Automation 6.8 and above.

Resolution

When checking activeMQ log from NAC, the following warning message can be seen:

[ActiveMQ VMTransport: vm://brokerNacServer#64509-1] WARN  (org.apache.activemq.broker.region.BaseDestination:718) - Usage(default:store) percentUsage=102%, usage=2202009600, limit=2147483648, percentUsageMinDelta=1%: Persistent store is Full, 100% of 2147483648. Stopping producer (brokerNacServer->brokerNesServer-58450-1756076910722-16634:2:1:1) to prevent flooding queue://flow_eventsQueue. See http://activemq.apache.org/producer-flow-control.html for more info (blocking for: 187s)

This means that NAC is overflown with activeMQ messages from NES servers, which are stored on the disk until they are discarded or processed. The default disk space limit is set to 2Gb and once it becomes full, NAC's activeMQ stops accepting any new messages. That can explain why all NES servers become "offline" in UI - they cannot deliver KeepAlive messages to NAC, which causes that NAC server determines that all NES are disconnected. 

As a temporary measure, this limit can be increased by modifying `webapps/datamanagement/WEB-INF/activemq-broker-nac.xml` file, under this section:

<amq:systemUsage>
        <amq:systemUsage>
            <amq:memoryUsage>
                <amq:memoryUsage limit="256 mb"/> <!--  Lower memory limits might lead to hanging consumers. See Jira AMQ-5202 -->
            </amq:memoryUsage>
            <amq:storeUsage>
                <amq:storeUsage limit="2 gb"/>
            </amq:storeUsage>
            <amq:tempUsage>
                <amq:tempUsage limit="500 mb"/>
            </amq:tempUsage>
        </amq:systemUsage>
    </amq:systemUsage>

 

This change requires that NAC server to be restarted.

The other option is via JMX console on NAC, under MBean "org.apache.activemq:type=Broker,brokerName=brokerNacServer" parameter "StoreLimit" which by default will have value "2147483648". It can be increases to 4gb (4294967296 ).

Additional Information

This behavior can be caused when there are too many tasks that perform daily activities on the NES, for example, local restart of the service, which generate message to NAC server.