From time to time we will run into unexpected scenarios with our environment. When these scenarios occur it can often lead to a connectivity problem between the NAC (aka management server) and NES (aka execution server). Good examples of these unexpected scenarios include:
Here, we will discuss some easy methods to discern if any of the scenarios just mentioned, and others, may have caused any corruption within the vital framework used for JMS (messaging) between the NAC and NES components.
So, what is, and how exactly do we determine if a problem resides within this framework/subsystem of Release Automation?
Operating System: N/A
Database: N/A
Release Automation Version: 5.0.X -> 6.8.X
The framework we are referring to is ActiveMQ, which is a high performance message broker utilized by the NAC and NES(management/execution) server components only for all inter-communication. ActiveMQ utilizes a persistence store on disk with pre-defined parameters, and from time to time, particularly for example, in the case of a disk outage, AMQ is unable to write messages to the store, causing inevitable corruption which can show up in a variety of behaviors.
The biggest, and most common behavior is going to be the inability for the execution server to connect to the management server, even though it appears the execution server context may have started. Almost every time, you can look to two log files in particular:
NAC: active_mq_nac.log (found in %installdir%\logs)
NES: active_mq_nes.log (found in %installdir%\logs)
Look these over carefully for WARN and ERROR priority log entries, specifically errors concerning missing and/or corrupt index, IO Exceptions, such as this example:
2016-10-31 11:20:11,167 [LevelDB IOException handler.] INFO (org.apache.activemq.util.DefaultIOExceptionHandler:155) - Stopping BrokerService[brokerNacServer] due to exception, java.io.IOException: Short writeThe above is typical when disk space has been exhausted, as well as:
2016-10-31 11:02:29,541 [ActiveApplicationContextManager-1] ERROR (org.apache.activemq.broker.BrokerService:1985) - Temporary Store limit is 500 mb, whilst the temporary data directory: /opt/ca_lisa/LISAReleaseAutomationServer/activemq-data/brokerNacServer/tmp_storage only has 0 mb of usable space - resetting to maximum available 0 mb.Interrupted connections between the NAC and NES (like a forced reboot after a patch) can be problematic for establishing a healthy connection after the NES has been rebooted. These log entries might be seen at such times:
Followed by a number of attempts to establish a connection and the connection being refused - like so:
2018-02-27 11:35:51,275 [ActiveMQ Task-2] INFO (org.apache.activemq.network.DiscoveryNetworkConnector:120) - Establishing network connection from vm://brokerNacServer?network=true to ssl://x.x.x.x:61616
The above examples can cause improper startup or subsequent NAC/NES connection problems.
The steps that need to be taken to clear out corruption in this scenario, and any others witnessed that hint toward persistence store corruption are as follows:
Assuming this was the issue, it should now(hopefully) be resolved.
Please contact support if you require any assistance 24/7
A symptom of the NAC not starting correctly could be that the ROC fails to load the login page in the browser.
In some cases to resolve the issue, stop both the NAC service and all NES services, and then remove all LevelDB folders.