The CA Release Automation management servers (aka NAC, datamanagement) have been installed/setup to use high availability (see Additional Information section below). Every so often an unexplained failover occurs.
There are two conditions that a CA Release Automation management server can failover:
These following scenarios are described in more detail below:
Time Synchronization
It is extremely important for the date/time on the two management servers to be in sync with each other - down to the same second. The reason this is the case is because:
When this condition occurs you will find a message similar to the following in the nolio_dm_all.log file:
2016-10-04 16:07:06,788 [periodicTasksMasterMonitor-1] INFO (com.nolio.platform.server.dataservices.services.ha.MasterNacService:287) - current master [MasterNac[id=<id_val>, nacNode=NacNode[id=<id_val>, hostname='<masterNacHostname>', ip='<masterNacIpAddress>'], lastIAmAlive=2016-10-04 16:06:51.0, firstIAmAlive=2016-10-04 16:01:07.0, upgradeState=null]] has not reported aliveness for more than 15000 ms.
Login Requests
For purposes of failover, what constitutes a "login request" that will cause a failover is a simple http get made against: http://yourNolioRaServer:8080/datamanagement/login.jsp
If the passive management server receives a login request then it believes that it must become the active management server. Login requests are typically handled by a frontend load balancer (that must be configured to send its 100% of its traffic to the active node). The load balancer is typically configured to switch which management server it considers to be active based on http get requests. In an environment that does not have time synchronization problems, this is failover method that one should typically see. But the failover could happen if someone mistakenly attempts to login directly to the passive management server.
When this condition occurs you will find a message similar to the following in the nolio_dm_all.log file:
2016-10-04 16:01:07,316 [http-nio-8080-exec-2] INFO (com.nolio.platform.server.dataservices.services.ha.MasterNacIdentifierInterceptor:129) - received new incoming request when I'm not master. Trying to become master before handling request...
Identify the cause for the failover based on the information provided (in the Cause section above) and review any information related to installation and configuration in the Additional Information section below. Adjust the settings and/or behavior to ensure reliable failover.
Note: It is worth noting that time synchronization problems may result in multiple failovers. This is indicated by the related time sync messages inside of the passive (becoming active) management server logs and related login request messages inside of the active (becoming passive) management server logs. This may happen when the passive (becoming active) has a date/time that is greater than the active (becoming passive) management server. Since the load balancer doesn't experience an error with its HTTP GET requests against the management server it thinks is active it continues to forward all of its traffic to the newly passive management server.
It is also worth noting that we have seen where servers configured to use NTP suddenly have its server time changed to something very unexpected which could trigger a time sync failover. This might reveal itself in the nolio_dm_all.log file with messages out of sequence.
Example:
Message 1: 2016-10-04 16:01:07,316
Message 2: 2016-10-04 15:26:57,320
Message 3: 2016-10-04 16:01:07,320
Notice how message 2 has a timestamp that is earlier than its previous message. This is an indication that someone has either manually changed the time on the server or NTP sync had a glitch.
More information related to setting up High Availability can be found here:
Install to Provide High Availability: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/
Architecture and Implementation: https://communities.ca.com/docs/DOC-231165900
CA Release Automation HA Configuration: https://communities.ca.com/docs/DOC-231172193
CA-Release-Automation-Artifactory-HA-Best-PracticesV2.5: https://communities.ca.com/docs/DOC-231153988
Apply Patches to a High Availability Installation: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/apply-patches-to-a-high-availability-installation
Execution Server High Availability Installation and Scalability: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/execution-server-high-availability-and-scalability
Upgrade a High Availability Deployment: https://docops.ca.com/ca-release-automation/6-3/en/installation/install-to-provide-high-availability/upgrade-a-high-availability-deployment