How to setup up a dedicated secondary hub (failover hub) for the Primary Hub using the HA probe
When the primary goes down, the secondary system brings up the services the HA probe has defined in its cfg file. Those probes do not have any idea of the 'state' of the primary - they are not HA-aware so to speak. If the cfg on the secondary is the same as on the primary (it may well not be for various reasons) the same alarms will be generated. How the nas deals with those alarms depends on how it's been set up.
HA probe doesn’t have a synchronization process – only a heartbeat to the primary. If the heartbeat fails based on configured intervals, then it starts activating probes and queues. We always recommend that if having an HA pair of hubs, that no 'monitoring' probes are present on either, just the core infrastructure probes.
Important: Note that the HA probe does not synchronize cfg files. It simply starts and stops queues and probes. Alarm synchronization is done by the nas probes internally, the HA probe does not have any part in that process.
Note that any preprocessing (filters) needs to take place at the originating nas. Assuming that the secondary hub is a pure play HA hub (passive), then it is not doing any pre-processing while in a passive state as all robots that are directly reporting to one of the HA pairs are reporting to the primary/active hub up until failure of that hub. The primary/active hub would have active pre-processing rules/filters based upon the incoming alarms from the robots and probes directly reporting up to it.
On failover, the HA probe activates the NAS AO with all the same profiles, filters, triggers, and scripts as are on the primary, the robots will switch over to the secondary and everything continues as it should from an alarm processing view. If the secondary is actually an active node and is also there for load balancing the robots, then it is a more complex scenario. Here are some reminders and pointers on what you will need to do in order to accomplish configuring your HA environment:
Instructions1. Deploy all the probes that you will want running on the secondary hub during a failover. Some examples are...
- data_engine for inserting QoS into the database
- nas for alarm access and processing
- distsrv for access to archive
- emailgtw for the ability to email alarms
- sla_engine for the ability to continue calculating your SLAs
Here are some considerations...
- create an ATTACH queue called data_engine. Select Type=attach and Subject=QOS_MESSAGE, QOS_DEFINITION. This is used by the data_engine probe to pull QoS messages and send them to the DB.
- create a queue called nas. Select Type=attach and subject=alarm. This is used by the nas probe to collect and process alarms
- If you have other hubs that send QoS/Alarms to the Primary, you will need to make the appropriate changes here. If the remote hubs have attach queues, then you must have a GET queue on the primary to get the data from the remote side, so you will need to create a get queue on the HA hub.
Notes on 'standby' Hub configuration:
- nas probe on the standby hub ->enabled.
- Replication enabled and set to bidirectional.
- AO ->enabled
- GET queue on the secondary ->disabled
- nis bridge-> disabled
HA function - technical details
HA determines whether it needs to failover or not by performing a “ping” to the hub. It uses nametoip call to find the IP for the hub it is a failover for.
So, one thing to check is to run a nametoip callback from the controller and/or hub of the failover/secondary hub to see what it thinks the primary is. It could be a bad hubs.sds file. We have seen a bad hubs.sds file cause the HA probe to failover because nametoip returned an incorrect IP.
Also, check network connectivity. Run a ping from one hub to another for a few hours and see if you have any periods of dropped packets. Run a tracert. Check resolution via nslookup.
Unexpected failover to the secondary hub could be caused by network connectivity/environmental issues.
Testing HA Failover
After you configure the HA probe on the secondary hub, if you bring down the primary to test the failover, you may see a message that 'Failover is being initiated...' but after that you also see that the data_engine probe does not start and turns red. Alarms are generated and you see in the alarm console data_engine probe alarm messages like:
"Data_engine failed - Probe 'data_engine' (command = data_engine.exe) returns no-restart code (42)"
"No valid SLM-QOS license was found"
As a workaround, you can logon to the secondary hub in IM, click on the Archive node and copy/paste the license.
Make sure you have configured the distsrv on the primary server to forward All probes as well as All Licenses to the secondary hub as part of the configuration.
This Article assumes you are using the latest version of the HA probe or one appropriate for a currently supported version of CA UIM.
Note on UMP
We do not have any probe/scripts to handle the failover of UMP. Also, if the primary hub fails, you need to manually point wasp to the HA hub as there is no automated UMP failover.
For 'load balancing' of UM