Incremental recovery is removing default gpbackup backup directory when pg_rewind is invoked in GPDB 6.x
search cancel

Incremental recovery is removing default gpbackup backup directory when pg_rewind is invoked in GPDB 6.x

book

Article ID: 296734

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Incremental recovery sometimes requires pg_rewind to be run. This is especially true in cases where the WAL timeline diverges between the primary and the mirror, which can happen when a primary failover occurs due to segment PANIC

pg_rewind by design will delete any files (except for logs and a few control files) that appear in the acting primary but is not there in the down mirror. 

By default, gpbackup places it's backup data directory (backups) in the primary's segment data directory.

For example:

[gpadmin@sdw2-lab1 ~]$ ls -l /data/primary/gp_6.18.2_202112121834_kevin_seg2
total 104
drwxrwxr-x 5 gpadmin gpadmin    54 Jan  4 17:49 backups
drwx------ 9 gpadmin gpadmin    97 Dec 23 22:39 base
-rw------- 1 gpadmin gpadmin 32768 Jan  4 21:32 fts_probe_file.bak
drwx------ 2 gpadmin gpadmin  4096 Dec 29 20:06 global
-rw------- 1 gpadmin gpadmin    10 Dec 12 18:35 internal.auto.conf
drwx------ 2 gpadmin gpadmin    18 Dec 12 18:35 pg_clog
drwx------ 2 gpadmin gpadmin    18 Dec 12 18:35 pg_distributedlog
drwx------ 2 gpadmin gpadmin     6 Dec 12 18:35 pg_dynshmem
...


This backups directory does not get replicated to the mirror directory:

[gpadmin@sdw1-lab1 ~]$ ls -l /data/mirror/gp_6.18.2_202112121834_######_seg2
total 76
-rw------- 1 gpadmin gpadmin   206 Dec 12 18:35 backup_label.old
drwx------ 8 gpadmin gpadmin    80 Dec 23 22:39 base
drwx------ 2 gpadmin gpadmin  4096 Dec 29 20:03 global
-rw-rw-r-- 1 gpadmin gpadmin    10 Dec 12 18:35 internal.auto.conf
drwx------ 2 gpadmin gpadmin    18 Dec 12 18:35 pg_clog
drwx------ 2 gpadmin gpadmin    18 Dec 12 18:35 pg_distributedlog
drwx------ 2 gpadmin gpadmin     6 Dec 12 18:35 pg_dynshmem
-rw------- 1 gpadmin gpadmin  4708 Dec 12 18:35 pg_hba.conf
-rw------- 1 gpadmin gpadmin  1636 Dec 12 18:35 pg_ident.conf
drwx------ 2 gpadmin gpadmin  4096 Jan  4 00:00 pg_log


In the case of a primary segment PANIC, or some other event that cause a WAL timeline divergence, this will cause the primary to failover to the mirror.

An incremental recovery afterwards would require pg_rewind to sync up the primary and mirror. You can tell if pg_rewind is invoked by observing the following lines during gprecoverseg:


No pg_rewind

20220104:21:42:11:024448 gprecoverseg:mdw-lab1:gpadmin-[INFO]:-Running pg_rewind on failed segments
sdw2-lab1 (dbid 4): no rewind required


pg_rewind required

20220104:21:54:12:027197 gprecoverseg:mdw-lab1:gpadmin-[INFO]:-Running pg_rewind on failed segments
sdw2-lab1 (dbid 4):  745723/1736705 kB (42%) copied


You can see when pg_rewind is run, it deletes the backups directory in the original primary:

[gpadmin@sdw2-lab1 ~]$ ls -l /data/primary/gp_6.18.2_202112121834_kevin_seg2
total 112
-rw------- 1 gpadmin gpadmin   175 Jan  4 21:54 backup_label.old
drwx------ 8 gpadmin gpadmin    80 Jan  4 21:54 base
-rw------- 1 gpadmin gpadmin 32768 Jan  4 21:54 fts_probe_file.bak
drwx------ 2 gpadmin gpadmin  4096 Jan  4 21:54 global
-rw------- 1 gpadmin gpadmin    10 Dec 12 18:35 internal.auto.conf


This can lead to two potential problems

1. Any previous gpbackup that used the default backup directory is now unusable for restore due to missing backup files from the one segment.

2. Incremental recovery can fail if it is unable to remove the backup directory during recovery due to a permissions issue or some other issue and it instructs you to run full recovery:

gprecoverseg log

20211216:11:29:41:702088 gprecoverseg:mdw-lab1:gpadmin-[WARNING]:-
20211216:11:29:41:702088 gprecoverseg:mdw-lab1:gpadmin-[WARNING]:-Incremental recovery failed for dbid 4. You must use gprecoverseg -F to recover the segment.


pg_rewind log

servers diverged at WAL position 848/9F60F628 on timeline 1
rewinding from last common checkpoint at 848/89EC5590 on timeline 1
reading source file list
reading target file list
reading WAL in target
need to copy 9770 MB (total source directory size is 2058875 MB)
could not remove directory "/data/primary/gp_6.18.2_202112121834_kevin_seg2/backups/20220104/20220104174918": Directory not empty
Failure, exiting


Note: This only affects incremental recovery. Full recovery will not have this problem.

Environment

Product Version: 6.18

Resolution

Fix

This is fixed in GPDB 6.19.2 and above.

Workaround for versions below 6.19.2

We currently recommend to not use the default backup location with gpbackup until this issue is fixed in a later version of Tanzu Greenplum.

Use gpbackup with the --backup-dir flag to point the backup to a location outside of the segment data directory. This will address both of the problems.

For example:
[gpadmin@mdw-lab1 ~]$ gpbackup --dbname gpadmin --backup-dir /home/gpadmin/backups

[gpadmin@sdw1-lab1 backups]$ ls -l /home/gpadmin/backups/
total 0
drwxrwxr-x 3 gpadmin gpadmin 21 Jan  4 22:20 gp_6.18.2_202112121834_kevin_seg0
drwxrwxr-x 3 gpadmin gpadmin 21 Jan  4 22:20 gp_6.18.2_202112121834_kevin_seg1
drwxrwxr-x 3 gpadmin gpadmin 21 Jan  4 22:20 gp_6.18.2_202112121834_kevin_seg2