book
Article ID: 296507
calendar_today
Updated On:
Issue/Introduction
In Greenplum versions 4 and 5, gprecoverseg uses the pg_changetrackinglog to determine the gprecoverseg.
In one case, it was because a restore filled up the segment directories and threw the database into recovery mode so the changetracking was large. Also, in that case, the mirrors were the only thing affected. The primaries were intact and never failed. This is not a double fault scenario.
gprecoverseg will run for an hour even if doing a gprecoverseg -i before it times out,
even if that filename was just pointing to one segment. If after an hour you get an error like, "failure: timeout Retrying no 1 failure: OtherTransitionInProgress failure: OtherTransitionInProgress", that means the recoverseg has timed out. It has ran for an hour and no progress has been made. One way to check that is if gpstate -e shows no change.
The possible root cause can be confirmed on a segment host under the pg_changetracking log. If that log directory size is 5-7 GB in size, then that is the cause of gprecoverseg timing out.
Resolution
First, kill any
recoverseg processes left over by doing the following:
pg_ctl -D </directory/ofMirror/segNo> stop -m fast
Next, do a
gprecoverseg -o to create a file with all the segments that need to be recovered. Open that file, if you use VIM then you can use:
:sort
:%s!^!#!
This will sort each line by each host and comment out each line. After you've done this you can uncomment the first line and as many as 1-10 lines, after which looks like this:
filespaceOrder=<hdd_fs:ssd_fs>
sdw1:5012:/data/mirror/gpseg21
#sdw1:5013:/data/mirror/gpseg22
gprecoverseg -i <filename> -F
This will clear out the
changetracking log and rewrite the segment. If the
recoverseg goes well and quickly, feel free to remove the
# in front of more lines and do a
recoverseg in bigger batches. Also, do not forget to remove or comment out the lines of segments recovered.