Scenario
This page gives an example of how to recover a peer that has been down for a long time (and the queues on the master have blown).
Slight variations of this procedure are also relevant to a peer that is on a machine that needs to be rebuilt (the extra step is a re-install) or bringing a new peer into a replication step (the extra steps are adding the new peer into the knowledge file and re-initing the servers).
Topology
Background for scenario
Example4 has been offline for quite sometime. Example (preferred master) has been queuing updates for Example4, but the queue has been completely filled and Example has had to mark Example4 as being OFFLINE and has subsequently deleted it's Example4 multi-write queue. This thus requires Example4's database to be manually recovered. The following details the steps to recover Example4 and to ensure that all outstanding updates made after the re-sync process have been applied to Example4, ensuring that Example4's database is synchronized with the rest of the multi-write peers.
Indications that there is a problem
Output from Example's (preferred master) alarm log reads:
20060816.104325 MW: Buffer (EXAMPLE4) greater than 60% full
20060816.104325 MW: Buffer (EXAMPLE4) greater than 70% full
20060816.104325 MW: Buffer (EXAMPLE4) greater than 80% full
20060816.104326 MW: Buffer (EXAMPLE4) greater than 90% full
20060816.104326 MW: Buffer (EXAMPLE4) greater than 100% full
20060816.104326 MW: Operation disabled for DSA 'EXAMPLE4'
This indicates that Examplee4 is now out of sync with the rest of the multi-write set and needs to be recovered manually.
Recovery Process
It is assumed that the failed peer (example4) is shutdown.
dxserver init exampleInitializing the preferred master before shutting down the syncing DSA (Example3) ensures that all future updates chained by Example are captured for Example4 when it is bought online.
Prior to init, the queues read
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): **QUEUE-PURGED-OUT-OF-ORDER**, total 0, waiting remote 0, confirmed local 0
Post init, the queues read
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): RECOVERING, total 0, waiting remote 0, confirmed local 0
dxserver stop example3At this point, updates will be being queued for Example3 as well as Example4 as can be seen from Example DSA's console:
dsa>get dsp;
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
EXAMPLE4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
dxdumpdb -O example3 -f data.ldif
dxserver start example3Example3 should quickly get back into synch as can be seen from Example DSA's console
dsa>get dsp;
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): OK, total 0, waiting remote 0, confirmed local 0
MOCOR4(): **MW-FAILED**, total 1, waiting remote 0, confirmed local 1
ldifsort data.ldif data-sorted.ldif
dxloaddb -p <c "AU"><o "Example"> -a 15 -n 1277 data-sorted.ldif Example4
dsa>get dsp;
EXAMPLE2(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE3(): OK, total 0, waiting remote 0, confirmed local 0
EXAMPLE4(): OK, total 0, waiting remote 0, confirmed local 0
Statistics:
Number of attributes types = 17
Number of entries = 1293
Number of node entries = 101
Number of leaf entries = 1192
Number of alias entries = 0
Number of level 1 entries = 15
Number of level 2 entries = 90
Number of level 3 entries = 1188
Number of level 4+ entries = 0
Number of values = 12208
Number of blob (>2K) values = 1
dxstatdb example2
Statistics:
Number of attributes types = 17
Number of entries = 1293
Number of node entries = 101
Number of leaf entries = 1192
Number of alias entries = 0
Number of level 1 entries = 15
Number of level 2 entries = 90
Number of level 3 entries = 1188
Number of level 4+ entries = 0
Number of values = 12208
Number of blob (>2K) values = 1
dxstatdb example3
Statistics:
Number of attributes types = 17
Number of entries = 1293
Number of node entries = 101
Number of leaf entries = 1192
Number of alias entries = 0
Number of level 1 entries = 15
Number of level 2 entries = 90
Number of level 3 entries = 1188
Number of level 4+ entries = 0
Number of values = 12208
Number of blob (>2K) values = 1
dxstatdb example4
Statistics:
Number of attributes types = 17
Number of entries = 1293
Number of node entries = 101
Number of leaf entries = 1192
Number of alias entries = 0
Number of level 1 entries = 15
Number of level 2 entries = 90
Number of level 3 entries = 1188
Number of level 4+ entries = 0
Number of values = 12208
Number of blob (>2K) values = 1
Conclusions:
Following the above steps will ensure that: