Incremental gprecoverseg fails due to PANIC: could not fsync file

search cancel

Incremental gprecoverseg fails due to PANIC: could not fsync file

book

Article ID: 296767

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

You run an incremental recovery after a segment goes down. The gprecoverseg process fails for an unknown reason and gpstate reports the segment is still marked down.

Checking the segment mirror log shows the following:

2021-09-29 09:30:03.964735 UTC,,,p23780,th1361979520,,,,0,,,seg4,,,,,"PANIC","58P01","could not fsync file ""base/16384/3936047.1"" (is_ao: 1): No such file or directory",,,,,,,0,,"md.c",1321,"Stack trace:
1    0xbf29ec postgres errstart (elog.c:557)
2    0xa83207 postgres mdsync (md.c:1318)
3    0xa52f38 postgres CheckPointBuffers (bufmgr.c:2008)
4    0x7475b7 postgres CreateRestartPoint (xlog.c:9184)
5    0xa03a30 postgres CheckpointerMain (checkpointer.c:527)
6    0x78a42e postgres AuxiliaryProcessMain (bootstrap.c:443)
7    0xa0f5b5 postgres <symbol not found> (postmaster.c:5837)
8    0xa1165c postgres <symbol not found> (postmaster.c:5502)
9    0x7fba4e8eb630 libpthread.so.0 <symbol not found> + 0x4e8eb630
10   0x7fba4dd64a13 libc.so.6 __select + 0x13
11   0x6b2ba8 postgres <symbol not found> (postmaster.c:1894)
12   0xa12c22 postgres PostmasterMain (postmaster.c:1523)
13   0x6b75a1 postgres main (main.c:205)
14   0x7fba4dc91555 libc.so.6 __libc_start_main + 0xf5
15   0x6c32bc postgres <symbol not found> + 0x6c32bc
"

If you look at the file system, you'll see the file exists on the mirror but not the primary.

Environment

Product Version: 6.18

Resolution

For an existing AO table, if the table is truncated and data inserted into it after the most recent checkpoint and then a failover to mirror occurs, then incremental recovery can run into this issue.

During an incremental recovery, if pg_rewind is required to be run then it deletes all the files related to the old relfilenode of an AO table since truncate had deleted them from the file system. However when it goes to replay the older pg_xlog on the mirror, then it contains references to the pre-truncated relfilenode and hence the fsync error on a non-existent relfilenode.

The code has been fixed to account for this problem in GPDB 6.18.1.

If you are on a version below 6.18.1, then you can use the following workaround:
*Note* - Before proceeding please confirm that the relfile mentioned in the error has no data (0 byte size) or is missing from primary and mirror. If a relfile exists with size greater than 0 bytes, please confirm this scenario matches this KB and make a backup of the primary and mirror relfiles before proceeding.

1.) Touch the file on the primary segment (SEGMENT_DATA_DIRECTORY/base/16384/3936047.1) and rerun incremental recovery. However, it is usually unclear how many files are actually affected. If more than 1 or 2 files are affected, then option 2 would be better.
2.) Run a full recovery (gprecoverseg -F)

Feedback

thumb_up Yes

thumb_down No