Recovering a multi-write peer database that has been down for sometime and resynching its multi-write peers

Products

CA Directory CA Security Command Center CA Data Protection (DataMinder) CA User Activity Reporting

Issue/Introduction

Scenario

This page gives an example of how to recover a peer that has been down for a long time (and the queues on the master have blown).

Slight variations of this procedure are also relevant to a peer that is on a machine that needs to be rebuilt (the extra step is a re-install) or bringing a new peer into a replication step (the extra steps are adding the new peer into the knowledge file and re-initing the servers).

Topology

Example (preferred master)
Example2
Example3
Example4

Background for scenario

Example4 has been offline for quite sometime. Example (preferred master) has been queuing updates for Example4, but the queue has been completely filled and Example has had to mark Example4 as being OFFLINE and has subsequently deleted it's Example4 multi-write queue. This thus requires Example4's database to be manually recovered. The following details the steps to recover Example4 and to ensure that all outstanding updates made after the re-sync process have been applied to Example4, ensuring that Example4's database is synchronized with the rest of the multi-write peers.

Indications that there is a problem

Output from Example's (preferred master) alarm log reads:

20060816.104325 MW: Buffer (EXAMPLE4) greater than 60% full
20060816.104325 MW: Buffer (EXAMPLE4) greater than 70% full
20060816.104325 MW: Buffer (EXAMPLE4) greater than 80% full
20060816.104326 MW: Buffer (EXAMPLE4) greater than 90% full
20060816.104326 MW: Buffer (EXAMPLE4) greater than 100% full
20060816.104326 MW: Operation disabled for DSA 'EXAMPLE4'

This indicates that Examplee4 is now out of sync with the rest of the multi-write set and needs to be recovered manually.

Environment

Release:
Component: ETRDIR

Resolution

Recovery Process

It is assumed that the failed peer (example4) is shutdown.

Init the preferred master

Issue dxserver init on Example to reset the queue status.

dxserver init example
Prior to init, the queues read 
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): **QUEUE-PURGED-OUT-OF-ORDER**, total 0, waiting remote 0, confirmed local 0
Post init, the queues read 
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): RECOVERING, total 0, waiting remote 0, confirmed local 0

Initializing the preferred master before shutting down the syncing DSA (Example3) ensures that all future updates chained by Example are captured for Example4 when it is bought online.

Note that enough time between the init and the shutdown of the good peer must be left so that any outstanding updates prior to the init have been processed on the good peer.

Shutdown, dump and restart a good peer

Shutdown Example3.

dxserver stop example3

At this point, updates will be being queued for Example3 as well as Example4 as can be seen from Example DSA's console:

dsa>get dsp;
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
EXAMPLE4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1

Dump Example3 data with operational attributes
```
dxdumpdb -O example3 -f data.ldif
```

Restart Example3

dxserver start example3

Example3 should quickly get back into synch as can be seen from Example DSA's console

dsa>get dsp; 
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0 
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0 
EXAMPLE4(): OK, total 0, waiting remote 0, confirmed local 0

MOCOR4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1

Load and restart the failed peer
1. Sort the data
```
ldifsort data.ldif data-sorted.ldif
```
2. Load the data
```
dxloaddb -p <c "AU"><o "Example"> -a 15 -n 1277 data-sorted.ldif Example4
```
3. Start Example4
  
  After the appropriate retry time, Example synchronizes the outstanding multi-write queue contents with Example4. Note that during the resynchronization process there may be a small number of errors reported due to replaying of operations that were already done on the good peer before it was shutdown and dumped. These can safely be ignored.
```
dsa>get dsp;
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): OK, total 0, waiting remote 0, confirmed local 0
```

Check databases are in synch

DB stats between the three databases should then be compared to ensure that they are in-sync. A high level comparison can be obtained using DXSTATDB. Example statistics are displayed below as a reference.

dxstatdb example

Statistics:

Number of attributes types =      17
Number of entries =               1293
Number of node entries =          101
Number of leaf entries =          1192
Number of alias entries =         0
Number of level 1 entries =       15
Number of level 2 entries =       90
Number of level 3 entries =       1188
Number of level 4+ entries =      0
Number of values =                12208
Number of blob (>2K) values =     1

dxstatdb example2

Statistics:

Number of attributes types =      17
Number of entries =               1293
Number of node entries =          101
Number of leaf entries =          1192
Number of alias entries =         0
Number of level 1 entries =       15
Number of level 2 entries =       90
Number of level 3 entries =       1188
Number of level 4+ entries =      0
Number of values =                12208
Number of blob (>2K) values =     1

dxstatdb example3

Statistics:

Number of attributes types =      17
Number of entries =               1293
Number of node entries =          101
Number of leaf entries =          1192
Number of alias entries =         0
Number of level 1 entries =       15
Number of level 2 entries =       90
Number of level 3 entries =       1188
Number of level 4+ entries =      0
Number of values =                12208
Number of blob (>2K) values =     1

dxstatdb example4

Statistics:

Number of attributes types =      17
Number of entries =               1293
Number of node entries =          101
Number of leaf entries =          1192
Number of alias entries =         0
Number of level 1 entries =       15
Number of level 2 entries =       90
Number of level 3 entries =       1188
Number of level 4+ entries =      0
Number of values =                12208
Number of blob (>2K) values =     1

Additional Information

Conclusions:

Following the above steps will ensure that:

Example4 is completely resynchronized with the other three DSA's
Example4 is back in the multi-write set and
That all updates are actively being chained by Example to all three multi-write DSA's.