The incremental recovery (gprecoverseg) fail with error "online backup was canceled, recovery cannot continue"

Products

VMware Tanzu Greenplum

Issue/Introduction

When trying to recover the down mirror instance, the mirror might not come back online.
The logs of the mirror instance will report the error "online backup was canceled, recovery cannot continue" along with a PANIC of startup process:

2023-02-12 06:55:17.533926 EST,,,p389191,th1845667968,,,,0,,,seg44,,,,,"PANIC","XX000","online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 1A781/EEF8C5E0; tli 5; prev tli 5; fpw true; xid 0/310554423; oid 412055539; relfilenode 140739262; multi 1; offset 0; oldest xid 703 in DB 1; oldest multi 1 in DB 16384; oldest running xid 0; shutdown, checkpoint record data length = 92, DTX committed count 0, DTX data length 4, prepared transaction agg state count = 0",,0,,"xlog.c",10097,"Stack trace:
1    0xbf748c postgres errstart (elog.c:557)
2    0x74655d postgres xlog_redo (xlog.c:10096)
3    0x74b178 postgres StartupXLOG (xlog.c:7314)
4    0xa16513 postgres StartupProcessMain (startup.c:248)
5    0x78887f postgres AuxiliaryProcessMain (bootstrap.c:453)
6    0xa124cc postgres <symbol not found> (postmaster.c:5885)
7    0xa15c5e postgres PostmasterMain (postmaster.c:1519)
8    0x6b4e91 postgres main (main.c:205)
9    0x7f6d6a9d1555 libc.so.6 __libc_start_main + 0xf5
10   0x6c0c3c postgres <symbol not found> + 0x6c0c3c 

2023-02-13 04:33:27.684227 EST,,,p155941,th-251901824,,,,0,,,seg44,,,,,"LOG","00000","startup process (PID 156313) was terminated by signal 6: Aborted",,,,,,,0,,"postmaster.c",4018,
2023-02-13 04:31:50.890779 EST,"gpadmin","template1",p26068,th-251901824,"10.10.1.112","48964",2023-02-13 04:31:50 EST,0,,,seg44,,,,,"FATAL","57P03","the database system is starting up","last replayed record at 1A77D/E037D100",,,,,,0,,"postmaster.c",2576,
2023-02-13 04:31:50.906336 EST,,,p26070,th-251901824,"10.10.1.112","48966",2023-02-13 04:31:50 EST,0,,,seg44,,,,,"LOG","00000","connection received: host=10.10.1.112 port=48966",,,,,,,0,,"postmaster.c",4698,
2023-02-13 04:31:50.906520 EST,"gpadmin","postgres",p26070,th-251901824,"10.10.1.112","48966",2023-02-13 04:31:50 EST,0,,,seg44,,,,,"FATAL","57P03","the database system is starting up","last replayed record at 1A77D/E037D100",,,,,,0,,"postmaster.c",2576,
2023-02-13 04:33:22.510882 EST,,,p156313,th-251901824,,,,0,,,seg44,,,,,"PANIC","XX000","online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 1A781/EEF8C5E0; tli 5; prev tli 5; fpw true; xid 0/310554423; oid 412055539; relfilenode 140739262; multi 1; offset 0; oldest xid 703 in DB 1; oldest multi 1 in DB 16384; oldest running xid 0; shutdown, checkpoint record data length = 92, DTX committed count 0, DTX data length 4, prepared transaction agg state count = 0",,0,,"xlog.c",10097,"Stack trace:2023-02-13 04:33:27.684274 EST,,,p155941,th-251901824,,,,0,,,seg44,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",3735,

The issue will occur each time an incremental recovery (gprecoverseg) is run and it will always fail on the same location of xlog (refer to the logs "xlog redo checkpoint: redo 1A781/EEF8C5E0")

# grep 'online backup was canceled, recovery cannot continue' gpdb-2023-02-1*
gpdb-2023-02-12_000000.csv:2023-02-12 06:55:17.533926 EST,,,p389191,th1845667968,,,,0,,,seg44,,,,,"PANIC","XX000","online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 1A781/EEF8C5E0; tli 5; prev tli 5; fpw true; xid 0/310554423; oid 412055539; relfilenode 140739262; multi 1; offset 0; oldest xid 703 in DB 1; oldest multi 1 in DB 16384; oldest running xid 0; shutdown, checkpoint record data length = 92, DTX committed count 0, DTX data length 4, prepared transaction agg state count = 0",,0,,"xlog.c",10097,"Stack trace:
gpdb-2023-02-13_000000.csv:2023-02-13 04:33:22.510882 EST,,,p156313,th-251901824,,,,0,,,seg44,,,,,"PANIC","XX000","online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 1A781/EEF8C5E0; tli 5; prev tli 5; fpw true; xid 0/310554423; oid 412055539; relfilenode 140739262; multi 1; offset 0; oldest xid 703 in DB 1; oldest multi 1 in DB 16384; oldest running xid 0; shutdown, checkpoint record data length = 92, DTX committed count 0, DTX data length 4, prepared transaction agg state count = 0",,0,,"xlog.c",10097,"Stack trace:

Earlier in the logs, before the recovery was attempted for the first time, another error was reported: "database system was interrupted while in recovery at log time 2023-02-08 02:30:04 EST"

2023-02-12 20:57:00.528638 EST,,,p156313,th-251901824,,,,0,,,seg44,,,,,"LOG","00000","database system was interrupted while in recovery at log time 2023-02-08 02:30:04 EST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,0,,"xlog.c",6442,

Environment

Product Version: 6.20

Resolution

This is a known issue in Greenplum, the issue has been fixed at 6.23.0 and later.
See Release Notes for more information:

Cluster Management
32190 : Resolved an issue where gprecoverseg failed to detect than an instance of pg_basebackup was already running, which led to the corruption of the data directory and a PANIC error.

The workaround of this issue is, to perform a full recovery against the mirror instance.

# gprecoverseg -F