Upon HA failover, secondary hubs did not switch over as expected.

book

Article ID: 143748

calendar_today

Updated On:

Products

NIMSOFT PROBES DX Infrastructure Management

Issue/Introduction

When failover occurrs, the secondary hubs are not connecting back to the Standby (HA hub). Even when the Robot Watcher service was back up on the Primary, child hub does not return to the Primary, and its lost, until I manually restart the Nimsoft watcher on that hub.

Logs in the HA hub show messages like this one:

hub: Queue 'From_Secondary_hub_QOS_Message' failed to connect to Hub addr=/Primary_domain/Secondary_hub/secondary/hub

Environment

Release : 8.51, 9.02, 9.2.0

Component : UIM - HA

Resolution

When a hub comes up it will send broadcasts (real UDP) on port 48000 and 48002. Robots that have been previously registered with a Primary hub, and receiving this broadcast from it's Primary hub should then move back to this hub.
 
That said, in many environments, broadcasts are not allowed - the hub probe has a back-up functionality to account for this:

It has an internal list of all the robots which have been connected to itself, and it will start querying for these robots once it is back online. The hub will then send a 'checkin' command to all the robots in this list, and these robots will then connect to its primary hub again (this of course requires that the robot was registered to that hub prior to the hub going down).
 
If the spooler is not able to send alarms (or QoS messages) to its primary hub, it will switch over to the secondary immediately. There is no setting to delay this.

Note that IF the spooler has nothing to send, there may be a long delay until it switches over.

When the Primary is back up it will send a broadcast to all robots. All robots which receive this broadcast will switch back immediately.
 
Then the hub starts to poll the rest of the robots one by one, but this may take some time. 

- Using Raw Configure mode, under the hub section on the Primary hub as well as the Secondary (HA) hub, make sure the parameter is set:

   broadcast_on = yes

- Upon failover, if the switch-over still doesn't work as expected, make sure you define a static hub entry on the Primary for the Secondary (HA) hub, and vice versa, if this wasn't done previously.

This issue was more about the remote hub NOT remaining available after the Primary was stopped and failed over to the Secondary.

Remote hubs don’t “switchover" to an upstream HA secondary when the Primary fails - the queue connections get switched by the HA probe as per its configuration on the Secondary.

The only changes we made were to add the Primary Hub and Secondary hub (IP Address/hostnames) into the remote hub Name services on the remote hub. Then all nodes remained green on failover except for the Primary (as expected).