$ gpstate -e ... 20190313:13:08:25:114088 gpstate:greenplum-aio-02:gpadmin-[INFO]:-Primaries in Change Tracking 20190313:13:08:25:114088 gpstate:greenplum-aio-02:gpadmin-[INFO]:- Current Primary Port Change tracking size Mirror Port 20190313:13:08:25:114088 gpstate:greenplum-aio-02:gpadmin-[INFO]:- sdw1 20000 32.1 kB sdw1 21000Running
gprecoverseg
to bring up the mirror will not report any issues.
$ gprecoverseg -a ... 20190313:13:08:53:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-Done updating primaries 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-****************************************************************** 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-Updating segments for resynchronization is completed. 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-For segments updated successfully, resynchronization will continue in the background. 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:- 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-Use gpstate -s to check the resynchronization progress. 20190313:13:08:55:114208 gprecoverseg:greenplum-aio-02:gpadmin-[INFO]:-******************************************************************The mirror segment will be brought up to resync the data from the primary segment. However, it will be marked as "down shortly".
=== Wed Mar 13 13:14:21 CST 2019 === dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port ------+---------+------+----------------+------+--------+-------+------------------+---------+------------------ 4 | 0 | m | m | r | d | 21000 | greenplum-aio-02 | sdw1 | 23000 ... === Wed Mar 13 13:14:22 CST 2019 === dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port ------+---------+------+----------------+------+--------+-------+------------------+---------+------------------ 4 | 0 | m | m | r | u | 21000 | greenplum-aio-02 | sdw1 | 23000 ... === Wed Mar 13 13:15:11 CST 2019 === dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port ------+---------+------+----------------+------+--------+-------+------------------+---------+------------------ 4 | 0 | m | m | r | d | 21000 | greenplum-aio-02 | sdw1 | 23000Running
gpstate -e
will show the mirror segment in a change tracking state again.
$ gpstate -e ... 20190313:13:17:33:118169 gpstate:greenplum-aio-02:gpadmin-[INFO]:-Primaries in Change Tracking 20190313:13:17:33:118169 gpstate:greenplum-aio-02:gpadmin-[INFO]:- Current Primary Port Change tracking size Mirror Port 20190313:13:17:33:118169 gpstate:greenplum-aio-02:gpadmin-[INFO]:- sdw1 20000 32.1 kB sdw1 21000
gprecoverseg
brings up the mirror segment, Master log shows primary reported FaultMirror (segmentstatus 11
) to Master. This indicates that the primary segment is not able to communicate with mirror segment (in this case, it not able to do filerep
to target). As a result, Master marked the mirror state to 'd
'.
2019-03-13 13:15:10.288980 CST,,,p106458,th-2083625216,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=2, content=0) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,,
2019-03-13 13:15:10.294342 CST,,,p106458,th-2082683008,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=2) reported mirroring fault with mirror (dbid=4), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",371,
2019-03-13 13:15:10.294375 CST,,,p106458,th-2082683008,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=2, content=0) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",972,
2019-03-13 13:15:10.294381 CST,,,p106458,th-2082683008,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=4, content=0) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",972,
2. Primary segment logs will shows the resynchronization is working on specific data files and detected checksum mismatch.
2019-03-13 13:08:53.212411 CST,,,p114574,th1114740608,,,,0,,cmd3,seg-1,,,,,"LOG","00000","resync scheduled base/16438/16389 index 22 rel storage mgr Buffer Pool(1) mirror sync state:Full Copy Needed(2) ao loss eof 0 ao new eof 0 changed page count 0 resync ckpt lsn 0/0 resync ckpt blkno 0 TID (3,187) serial num 985",,,,,,,0,,"cdbfilerepresyncmanager.c",1238, ... 2019-03-13 13:08:53.297826 CST,,,p114577,th1114740608,,,,0,,,seg-1,,,,,"WARNING","01000","page verification failed, calculated checksum 34306 but expected 23643",,,,,,,0,,"bufpage.c",142, 2019-03-13 13:08:53.297849 CST,,,p114577,th1114740608,,,,0,,,seg-1,,,,,"ERROR","XX001","invalid page in block 0 of relation base/16438/16389",,,,,,,0,,"bufmgr.c",408,3. As a result, the primary resync worker ended and report failure in mirroring.
2019-03-13 13:08:53.298388 CST,,,p114569,th1114740608,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary resync worker #3 process' process pid '114577' exit status '0' ', mirroring role 'primary role' mirroring state 'resync' segment state 'up and running' filerep state 'not initialized' process name(pid) 'filerep main process(114569)' 'cdbfilerep.c' 'L2114' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1824, 2019-03-13 13:08:53.299765 CST,,,p114575,th1114740608,,,,0,,,seg-1,,,,,"LOG","00000","failure is detected in segment mirroring, failover requested",,,,,"mirroring role 'primary role' mirroring state 'resync' segment state 'in fault' process name(pid) 'primary resync worker #1 process(114575)' filerep state 'up and running' ",,0,,"cdbfilerepprimary.c",271,4. This why
gprecoverseg
is not able to bring up the mirror.
gpdb-2019-03-12_000236.csv:2019-03-12 00:10:59.950156 GMT,,,p194254,th-1221691520,,,,0,,cmd19,seg-1,,,,,"ERROR","XX001","invalid page in block 857 of relation base/2230543/1992135",,,,,,,0,,"bufmgr.c",408,
1. Get the database name mapping to oid 2230543
by running the below query on Master:
# SELECT datname from pg_database where oid=2230543;2. Connect to the primary segment server.
# ssh gpadmin@den-gpprod1-seg11 # PGOPTIONS='-c gp_session_role=utility' psql -p [port] [DBNAME]3. Run the following query to find out which table is mapping to
base/2230543/1992135
.
# SELECT s.nspname||'.'||c.relname as table from pg_namespace s, pg_class c where s.oid=c.relnamespace and c.relfilenode=1992135;4. Run the below query (from utility mode) to check if the table can be read on the primary segment.
# copy (select * from [table_name]) to '/dev/null';5. If the table is corrupted, we may either drop or truncate the table from the Master server and try
gprecoverseg
again.