DX UIM Disaster Recovery - questions regarding changes and moving probe configurations

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

We are going to test DR activity, for that our primary is in LocationX all the profiles and configuration done with LocationX, but now we are going to shutdown the Location1 for 1 week, so which files and which configuration we need to copy from Location1 to Location 2.

Location1 - Primary
Location2 - Secondary
Regarding probes configuration,
- How to move net_connect from Location1 to Location2
- How to move nas from Location1 to Location2

Environment

Release: 20.3 or higher
HA probe
Disaster Recovery

Cause

Guidance

Resolution

Disclaimer: DR configuration is generally out-of-scope for Broadcom UIM Support but some general guidance is listed below.

DR - specific questions raised:

Probe configurations
Move of net_connect / nas from one location to another

What is documented in the HA guide for failover to a HA setup includes a separate Primary Hub, Secondary/Remote Hubs, Queues, Database instance, etc., with a standard approach of using a different region or approach of more than 100 miles away, depending on customer's requirements, but we don't have a definitive/official DR guide for DX UIM at this time. Customers should follow DR principles and their own environment will dictate the overall architecture.

UIM High Availability (HA) Guide

Note that Active/Passive Microsoft Cluster is preferable to the use of HA.

Disaster Recovery Points of Interest

As long as you create an HA/failover hub as specified in the HA guide you can copy the nas.cfg over to the DR environment and make any adjustments to the nas.cfg, that are necessary, along with copying over any required scripts/files.
net_connect is normally best deployed on secondary/remote hubs and not the Primary hub (since that should be used as a dedicated monitoring server). In that case, you can copy the net_connect.cfg over to the hub for monitoring the same targets (as long as they are still all valid/reachable) from the source DR environment.
Any probe configurations or probe configuration 'packages,' custom, OS-specific or otherwise, can be saved and copied over to a new DR environment.
wasp instances: for the Admin Console wasp, Operator Console (OC) wasp and/or cabi wasp, please make sure that the wasp can connect successfully to the backend database without failure/error, or any interference, e.g., firewalls, and in the case of OC wasp, that the ump_common section is updated with the correct NimBUS addresses of the required probes.
Disaster recovery relies upon the replication of data and computer processing in an off-premises location not affected by the disaster. When servers go down because of a natural disaster, equipment failure, or cyber attack, a business needs to recover lost data from a second location where the data has been backed up.
There are disaster recovery options available to customers in many types of environments including private and public cloud environments as well.
In general, active/passive strategies use an active site to host the workload and serve traffic. The passive site (e.g., in a different region) is used for recovery. The passive site does not actively serve traffic until a failover event is triggered.
With High Availability there is usually a bit of recovery 'crossover' and the High Availability guidance combined with a 'Build & Integration' guide usually suffice. A Build & Integration guide is basically screenshots of the installation and screenshots of configurations, for example, the message queues, hub tunnels, etc, which are specific for each customer.
From one perspective, Disaster Recovery (DR) is when HA has failed (maybe both the Primary and Secondary/HA are down and/or have been lost).
'DR' focus for UIM monitoring centers on rebuilding those hubs that have been lost and reconnecting to the database or recovering the database as well as the hubs.
DR is all about recovering configuration from backups and applying the configurations to a new build of the components that have failed.
If a customer needs to test failover and fallback, then new virtual machines need to be provisioned, and the environment rebuilt using them (to test the 'worst-case' scenario). This is all assuming that the new servers have the same IP addresses.
It’s pretty rare to be in this situation, as we know the world of virtualization is very capable and we can recover from a recent snapshot.
A 'DR guide' would be different for every customer depending on the industry, customer requirements, security, regulations, network, environment, etc. That being stated, any sizeable environment does need some sort of plan, however brief.
DR is usually architected and delivered by companies/partners/services consultants that have a lot of experience with DR.

Additional Information

DX UIM does not have any built-in special DR features, so like most other products you can incorporate a DR product. There are a few enterprise DR Solution Software packages out there. We have seen VM SRM work, and move UIM Infrastructure servers to other data centers on the fly and continue working after a few tweaks, with full capabilities. This needs to be paired with a DB DR tool (MS Clusters for MS SQL)also sold separately, for the UIM DB.

Keep in mind that UIM is a suite of products deployed as a single solution. When considering DR, attend to each of the pieces rather than the whole. It will make the whole solution easier to craft. If you lost your Primary data center, do you need the ability to run reports in CABI? Or Internet access to the web interface? Maybe not...

Keep in mind realistic function requirements. Does your monitoring infrastructure need a 15-minute function restoration or could it tolerate three days? The reality is probably closer to days than minutes. After all there's usually a human interacting with those monitored systems that, while maybe frustrated, will tell you about any issues they find.

We suggest trying not to use tunnels or hub Name Services and our hubs do still find each other (VOID from FW/IPTable rules). With tunnels, you could have two sets for each client to connect to the 2 different IP’s and toggle as needed. Primary hub, OC and CABI all seem to reconnect fine as HostNames are being used. Certs can be created with both IP’s included when created in the keytool. Robots can be a pain as if they were not in a previous DR and don’t know that they can have the other IP assigned, you need to “security verify” them - this can be done via a probe utility (pu) script.

Revalidate probes via script when ...'FAILED to start file check determines changes in the probe'

This is all assuming your DR maintains the same IP’s on each failover, as most of our clients do.

The UIM backend database has to be completely transparent to use and should have its own DR facility. Some of our clients use MS SQL Cluster so the logical name to connect to has the DB in either data center 1 or data center 2, where the failover happens between the two.

The DX UIM Support team does not provide a DR option. For DR, you need to use 3rd party products for DR. For example: VMWare Server Recovery Manager, SQL Cluster, any Cluster, and now Azure, and AWS.

We do provide a Fault Tolerant solution or High Availability (HA). Note that there is a vast difference between definitions for DR and FT/HA. Some organizations need DR, but also incorporate a fault-tolerant HA solution, e.g., for Primary and Secondary failover for a short period, e.g., 1-3 hours. You can easily find the DX UIM HA Guide via web search.

If you need further guidance or expert advice from one of our experts, please contact your Account Director and request contacting one of our 'Residents.'