gpdr restore fails with, "could not resume WAL replay: ERROR: could not access status of transaction 3"

Products

VMware Tanzu Greenplum VMware Tanzu Data Services

Issue/Introduction

The command gpdr restore --type continuous --restore-point 20241204-091001R fails.

[gpadmin@lx00763 Broadcom]$  gpdr restore --type continuous --restore-point 20241204-091001R --debug
20241204:17:27:19 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-Running command: gpdr restore --type continuous --restore-point 20241204-091001R --debug
20241204:17:27:19 gpdr:gpadmin:lx00763:3967298-[INFO]:-Restoring database cluster
20241204:17:27:19 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-Checking for pgbackrest conf files
20241204:17:27:20 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-source /usr/local/greenplum-db-7.3.1/greenplum_path.sh &&  pgbackrest --log-level-console warn --stanza gpdb-seg-1 --config /usr/local/gpdr/configs/pgbackrest-seg-1.conf repo-ls gpdr/restore-points --recurse --filter "(/20241204-091001R)$"
20241204:17:27:20 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-source /usr/local/greenplum-db-7.3.1/greenplum_path.sh &&  pgbackrest --log-level-console warn --stanza gpdb-seg-1 --config /usr/local/gpdr/configs/pgbackrest-seg-1.conf repo-ls gpdr/restore-points --recurse --filter "(20241204-091001R)$"
20241204:17:27:20 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-source /usr/local/greenplum-db-7.3.1/greenplum_path.sh &&  pgbackrest --log-level-console warn --stanza gpdb-seg-1 --config /usr/local/gpdr/configs/pgbackrest-seg-1.conf repo-ls gpdr/restore-points --sort asc --recurse
20241204:17:27:20 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-setting gp_pause_on_restore_point_replay to '20241204-091001R' on all segments
20241204:17:27:20 gpdr:gpadmin:lx00763:3967298-[DEBUG]:-Reload postgresql.conf using pg_ctl
20241204:17:27:21 gpdr:gpadmin:lx00763:3967298-[ERROR]:-error occurred while restoring database cluster: could not resume WAL replay: ERROR: could not access status of transaction 3  (seg37 slice1 10.198.27.102:6001 pid=3920985) (SQLSTATE 58P01)
Please refer to /home/gpadmin/gpAdminLogs/gpdr_20241204.log file for details.

Back trace from the logs

2024-12-04 20:08:04.122580 CET,"gpadmin","postgres",p4039978,th-1573766976,"10.10.10.10","44824",2024-12-04 20:08:04 CET,0,con9943,cmd3,seg0,,,,sx1,"LOG","00000","statement: SET application_name TO 'gpdr'"
0,,"postgres.c",1729,
2024-12-04 20:08:04.136739 CET,"gpadmin","postgres",p4039978,th-1573766976,"10.10.10.10","44824",2024-12-04 20:08:04 CET,0,con9943,cmd5,seg0,,,,sx1,"LOG","00000","statement: SET gp_hot_standby_snapshot_mod
nconsistent'",,,,,,,0,,"postgres.c",1729,
2024-12-04 20:08:04.145005 CET,"gpadmin","postgres",p4039978,th-1573766976,"10.10.10.10","44824",2024-12-04 20:08:04 CET,0,con9943,cmd8,seg0,slice1,,,sx1,"LOG","00000","statement: SELECT pg_wal_replay_resu
FROM gp_dist_random('gp_id')
UNION ALL
SELECT pg_wal_replay_resume();",,,,,,"SELECT pg_wal_replay_resume()
FROM gp_dist_random('gp_id')
UNION ALL
SELECT pg_wal_replay_resume();",0,,"postgres.c",1288,
2024-12-04 20:08:04.220129 CET,"gpadmin","postgres",p4039978,th-1573766976,"10.10.10.10","44824",2024-12-04 20:08:04 CET,0,con9943,cmd8,seg0,slice1,,,sx1,"ERROR","58P01","could not access status of transaction 3"
,"Could not open file ""pg_distributedlog/0000"": No such file or directory.",,,,,"SELECT pg_wal_replay_resume()
FROM gp_dist_random('gp_id')
UNION ALL
SELECT pg_wal_replay_resume();",0,,"slru.c",939,"Stack trace:
1    0xd01646 postgres errstart (elog.c:494)
2    0x6c7675 postgres <symbol not found> (slru.c:939)
3    0x806fc2 postgres SimpleLruReadPage (slru.c:460)
4    0x82bf10 postgres DistributedLog_AdvanceOldestXmin (distributedlog.c:251)
5    0xb6c10d postgres GetSnapshotData (procarray.c:2651)
6    0xd5f5c8 postgres GetTransactionSnapshot (snapmgr.c:439)
7    0xb97fed postgres PortalStart (pquery.c:638)
8    0xb9164d postgres <symbol not found> (discriminator 10)
9    0xb959fb postgres PostgresMain (postgres.c:5583)
10   0xafe118 postgres <symbol not found> (postmaster.c:4605)
11   0xaff01e postgres PostmasterMain (discriminator 5)
12   0x76eba3 postgres main (main.c:173)
13   0x7f35a2829590 libc.so.6 <symbol not found> + 0xa2829590
14   0x7f35a2829640 libc.so.6 __libc_start_main + 0x80
15   0x77aa25 postgres _start + 0x25

Environment

Greenplum_6.29.0

Greenplum_7.4.0

Cause

GPDR psql sessions that use gp_hotstandby_snapshot_mode 'inconsistent' will try to read the distributed log with an invalid xid and then fail to find an existing distributed log to go with that invalid xid. This is a know issue.

Resolution

The fix is currently in production and will be released early 2025. Look for the incident number 33682 in future GPDB release notes