Cara failover Issue: Two NACs Active At Same Time

book

Article ID: 205431

calendar_today

Updated On:

Products

CA Release Automation - Release Operations Center (Nolio) CA Release Automation - DataManagement Server (Nolio)

Issue/Introduction

The Highly Available (HA) NACs experienced a failover. The secondary NAC became the Master NAC but failed to connect to the execution servers. Why? 

To understand if this scenario matches what you're seeing please see the section "Two NACs Active At Same Time" in the "Addition Information" area at the bottom of the article. 

Cause

The root cause is still unclear. Review the Resolution section for recommendations. 

While investigating this issue we found that:

  • The typical primary NAC did not acknowledge the change in its roll from active/master nac to passive. This happened while the secondary NAC started the the master application context dm so both of them were essentially acting as though they were the master/active NAC.
  • Right before this condition occurred, the following SQL related errors were observed in the nolio_dm_all.log on both the primary and secondary NACs:
    • SQL Error: 983, SQLState: S0001
    • Unable to access database '<db_name>' because its replica role is RESOLVING which does not allow connections. Try the operation again later. ClientConnectionId:d1fde8a2-c208-4ebc-b0fc-52f88b8028f3
    • org.springframework.messaging.MessagingException: Failed to invoke method
      • Caused by: org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.PersistenceException: org.hibernate.exception.SQLGrammarException: Could not open connection
      • Caused by: javax.persistence.PersistenceException: org.hibernate.exception.SQLGrammarException: Could not open connection
      • Caused by: org.hibernate.exception.SQLGrammarException: Could not open connection
      • Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Unable to access database '<db_name>' because its replica role is RESOLVING which does not allow connections. Try the operation again later. ClientConnectionId:d1fde8a2-c208-4ebc-b0fc-52f88b8028f3

 

Environment

Release : 6.7.0.b124

Component : CA RELEASE AUTOMATION RELEASE OPERATIONS CENTER

Database: MSSQL 2019 with Always On Availability Group

Resolution

Diagnosing and resolving the issue are two different things. If you need to diagnose root cause then please review and capture the data outlined in the "Diagnosing" section in Additional Information area (below) before applying the steps to resolve the problem outlined below. 

To resolve the problem:

  • Shutdown both NACs and delete the folder: NAC_HOME/activemq-data/nac/LevelDB
  • Shutdown your Execution Servers and delete the folder: NES_HOME/activemq-data/nes/LevelDB
  • Start one of the NACs. The NAC you start should not be having any connectivity issues. If it is having issues then you may need to repeat this process and start the other NAC. 
  • Start the NES. 

Additional Information

Diagnosing

Information from each of the MSSQL Servers participating in Always On Availability Group:

  • SQL error log files
  • WSFC log
  • CLUSTER.LOG
  • sp_who2
  • select * from sys.dm_exec_connections
  • select * from master_nac;
  • select * from nac_nodes;
  • netstat -aon

 

Information from both NACs participating in HA setup:

  • logs folder
  • conf/version
  • conf/nacNodeId
  • webapps/datamanagement/WEB-INF/distributed.properties
  • webapps/datamanagement/WEB-INF/database.properties
  • values from the following JMX calls:
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=HighAvailability -> currentMasterNac
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=HighAvailability -> getImAliveGracePeriod
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=HighAvailability -> getContextManagerState
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=HighAvailability -> getMonitoringDelay
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=DataSource -> screenshot of attribute values
    • http://NolioServerName:20203/mbean?objectname=noliocenter:type=DataSource -> getNumActive

      Note: The DataSource information may not be available from the secondary NAC. It should still be tried, in case it is available, since it is available from primary NAC's and the nature of the problem described in this KB article is where both NACs are behaving as if they are both the primary NAC. If it's available from both, please provide the info. If it's not available from both then don't get hung up on this item. It is okay to move on to collecting the next piece of info. 

  • netstat -aon from the database servers 
  • date/time output from command prompt
    Note: This is a command that you should run to evaluate whether or not the system's date/time are out of sync. If yes, by how much? A failover can occur if the two systems time are out of sync by more than 15 seconds.

 

Information from Load Balancer monitoring/testing the datamanagement/availability URL of the mgmt servers:

  • Logs showing showing:
    • the URL test results
    • where the requests were sent

       

Two NACs Active At Same Time

Nolio supports Active/Passive High Availability NACs. Not Active/Active. 

In this specific scenario, the information needed from the database and NACs are equally important. Having only one will likely be insufficient because of the nature of the problem - both NACs running as the active/master NAC. The reason why both are equally important is because both the primary and secondary NACs query the database (every 1 second) to determine if it is the master NAC. It does this by comparing the id in the master_nac table to the conf/nacNodeId value on its local NAC server. If it is the same then it will continue running as the active NAC. If it is different then it will either shutdown or start the "master application context dm" depending on whether it was the master NAC or not. This is how each of the NACs understand which role they are supposed to play in the active/passive HA setup that Nolio offers. 

If the NAC was the master and detects it's roll has changed to be the passive NAC then it will log these messages:

  • first I Am Alive report time of new master is [2020-12-18 15:21:19.267].
  • last I Am Alive report time of new master is [2020-12-18 15:21:19.267].
  • I am no longer the master NAC. Master is MasterNac[id=1, nacNode=NacNode[id=10008, hostname='<hostname of New NAC>', ip='<IP Address Of New NAC>'], lastIAmAlive=2020-12-18 15:21:19.267, firstIAmAlive=2020-12-18 15:21:19.267, upgradeState=DONE]
  • stopping master application context...

If the NAC was passive and its roll has changed to be the master NAC then it will log these messages:

  • an attempt is made to force this NAC to be master.
  • I became the master NAC.
  • forced this NAC to be master successfully.
  • starting master application context...

Since both NACs should be using the same DB there should not be a time when the NACs are out of sync in terms of what roll they're playing. Anytime one NAC shows either of these messages the other NAC should also show the other set of messages. If they're not then that indicates a difference in what they're getting from the database and that both NACs trying to behave as the active/master NAC at the same time (which is unsupported).