Rebuild a Fault Tolerant Data Aggregator pair when one server failed
search cancel

Rebuild a Fault Tolerant Data Aggregator pair when one server failed

book

Article ID: 410281

calendar_today

Updated On:

Products

Network Observability CA Performance Management

Issue/Introduction

One of the DA FT cluster is crashed and need to rebuild in a new host. The immediate goal is to decommission the existing DA server that is down, and reinstall it in a new host with the same PM version.

Then reconnect to the existing DA cluster. Given that I have done the DA backup (config files), and I can restore all the DA config files, will this work ?

One of the Fault Tolerant (FT) Data Aggregator (DA) servers failed. We need to replace the failed FT DA host with a new host and rebuild the FT DA pair and it's communications.

How do we replace a failed DA when it's one of a FT DA pair?

Environment

All supported Network Observability DX NetOps Performance Management Data Aggregator releases

Cause

Server failed in a state requiring replacement instead of rebuild.

Resolution

  1. Prevent the proxy from trying to restart active DA, and from starting the new DA post install.
    1. Shut down consul-ext and consul services on the working active DA.
    2. Shut down the consul services on the proxy host server
      • Leave the proxy servers daproxy service running. Do NOT shut that down.
  2. Rebuild the failed DA by installing a new DA on the new host.
    • Ensure the new DA meets all necessary pre-requisites including appropriate port access and access to the shared data dir.
    • When installing the new DA ensure the correct answers are provided for the install.
      • Answer Yes to "Would you like to configure Data Aggregator with fault tolerance?"
         
      • Provide the proxy host name for "Data Aggregator proxy host :"
         
      • Specify the same shared data dir the working DA uses when asked for the path. Both FT DA's need to use the same shared data directory.
      • Provide it with the correct DR DB host name(s) when asked about the "Data repository server hostname/IP :".
  3. Shut down the new DA's consul and consul-ext services only.
  4. Regenerate the acl token via the bootstrap process with the steps from the following article. We should be able to continue from step 3 in the articles Solution field.
  5. Once completed confirm using consul commands that we see the correct DA's and proxy listed.

Note that it can take some time after completing this process before the Data Aggregator table in the Portal System Status page shows the correct FT DA pair.

Additional Information

When installing the new second FT DA to replace the failed one, no special steps are needed related to it use of the shared data directory.

  • We do not override the contents of the shared data dir during the second DA install or upgrade. It should see that it's the second DA, not the first DA.
  • A key to this working is ensuring we run the DA install the same release as the remaining working DA.
  • It's likely post install the new DA will show problems related to the acl token validity.
    • Ignore that until the process to generate a new token is completed as part of the failed FT DA replacement process.