Tunnel redundancy for failover using HA probe

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

The "HA" or High Availability probe is traditionally used to ensure availability of the Primary Hub core services (NAS, data_engine, etc.) and the use for this purpose is well documented.
Less well-known/well-documented is the fact that the HA probe can be used for any pair of hubs to provide a layer of redundancy/high availability.
A good example of this is providing redundancy for tunnel servers, so that a tunnel client can have more than one path to send data to the primary hub.
This document can serve as an example/guide for such a setup.

Environment

DX UIM - Any Version
SSL Tunnels - at least two tunnel servers
HA probe

Cause

Guidance for HA hubs with tunnels

Resolution

In this example we will focus on the following hubs:

ExampleHubName: the primary hub
Tunnel-Client-One: a remote tunnel client
Tunnel-Server-Main: a tunnel server designated as the "primary" tunnel server for the remote client
Tunnel-Server-Backup: a tunnel server designated as the "standby" or HA hub for the remote client

Tunnel Redundancy

In this environment, Tunnel-Client-One has two client connections defined - one for each of the tunnel servers. These tunnel servers can be located in completely different datacenters/locations as long as the client can reach both of them over the network.

Both tunnels will remain connected at all times
If one tunnel server goes down, the client will automatically switch to sending data through the other; this generally happens almost immediately, but it may take up to one hour before the change is reflected in the Infrastructure Manager/Admin Console.
Alarms and QoS Data will continue to flow upwards, but the hub may appear red/unavailable.

Further explanation of why this can take some time can be found in this article.

Queue Configuration

Once the redundant tunnels have been established, you should configure queues as follows:

On each of the tunnel servers, configure "ATTACH" queues for alarms, qos, discovery, etc - these will be retrieved by the Primary hub (standard queue setup).
On the tunnel client, create the same queues - these will be retrieved by the tunnel servers (we will set that up shortly.)
On the Primary hub, create the corresponding "GET" queues for both the main and the standby tunnel servers:
Now move first to the standby tunnel server and configure GET queues to retrieve messages from the tunnel client and then immediately deactivate the queues before saving.
Now repeat this on the primary/main tunnel server - but in this case, leave the queues activated. (We had to leave them deactivated in the previous step to allow us to create them here.)

At this point we now should have data flowing from the client hub through the GET queues from the Main tunnel server, then from there to the primary hub through the GET queues there.

HA deployment

The next step is to deploy the HA probe to the Standby tunnel server.

The probe will deploy in a deactivated state. You can leave it deactivated for now, and double-click on it to bring up the configuration GUI.

Go to the 'Configure' tab, and select the Main Tunnel Server as the hub to synchronize with:

Next, under "Queues to enable" add the three GET queues (the ones which we left deactivated earlier):

Next in the "Options" tab - if you do not have NAS probes on your tunnel servers then make sure to uncheck the NAS AO option:

Otherwise, click OK and then activate the HA probe.

Upon activation, you should see a message indicating that contact was "restored" with the Main tunnel server:

Feb 26 18:24:28:404 0 HA: ****************[ Starting ]**************** 
Feb 26 18:24:29:407 0 HA: INFO: FAILBACK: Connection to '/ExampleDomain/Tunnel-Server-Main/tunnel-server-robot/hub' restored. Issuing state change.

Verification

To validate the setup, first, open the hub probe GUI on the Main tunnel server, and in the "Status" tab, verify that the three GET queues are active:

On the tunnel client the Status tab should show the Main tunnel server connected to the same queues:

Now, to simulate an outage, stop the robot (e.g. stop the underlying Service) on the Main tunnel server.

After a moment, you should see that the HA probe has noticed the outage by looking at the log:

Feb 26 18:29:55:717 0 HA: WARN: FAILOVER: Failed to contact primary hub '/ExampleDomain/Tunnel-Server-Main/tunnel-server-robot/hub': communication error. Issuing state change.

And if you check the hub GUI/Status tab on the Backup tunnel server you will note that the queues there have activated:

As mentioned above, the Infrastructure Manager client or Admin Console will temporarily show the client hub as unreachable along with the primary tunnel server:

This willl normally take around 40-60 minutes to correct itself and allow communication with the client - however, be assured that alarms and data are still coming in from this hub during that time.

Additional Information

Note on Data Origins

In this example, the tunnel servers themselves do not have additional robots attached, so there is no monitoring data being submitted from their respective origins. It is assumed that all monitoring data is coming from robots attached to the "Client" server in which case the Origins will not change.

In the event that you do have robots attached to the tunnel servers, you may need to update/override the Origin on the secondary/standby hub to match that of the first/main hub (assuming that the robots attached to the Main tunnel server will fail over to the standby hub at the same time the tunnels fail over.)

If there are no robots or monitoring data being submitted from the tunnel servers directly this is not necessary.

More information about Origin overrides is available here.