gprecoverseg utility couldn't synchronize down segments ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000003000000D60000000F has already been removed
search cancel

gprecoverseg utility couldn't synchronize down segments ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000003000000D60000000F has already been removed

book

Article ID: 296643

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Incremental recovery using gprecoverseg is run to synchronize all segments on one host. The command fails to synchronize some or all of the segments.

Segment log file:
2021-03-05 11:20:29.271175 CET,,,p47885,th-196609920,,,,0,,,seg18,,,,,"LOG","00000","started streaming WAL from primary at D6/3C000000 on timeline 3",,,,,,,0,,"walreceiver.c",384,
2021-03-05 11:20:29.565898 CET,,,p47885,th-196609920,,,,0,,,seg18,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000003000000D60000000F has already been removed
",,,,,,,0,,"libpqwalreceiver.c",555,"Stack trace:
1 0xbef7fc postgres errstart (elog.c:557)
2 0xa45ba1 postgres <symbol not found> (libpqwalreceiver.c:559)
3 0xa3a17c postgres WalReceiverMain (walreceiver.c:435)
4 0x787a0a postgres AuxiliaryProcessMain (bootstrap.c:438)
5 0xa0c42c postgres <symbol not found> (postmaster.c:5837)
6 0xa0e46f postgres <symbol not found> (postmaster.c:2138)
7 0x7f6cf1a85630 libpthread.so.0 <symbol not found> + 0xf1a85630
8 0x7f6cf0efe983 libc.so.6 __select + 0x13
9 0x6b02f8 postgres <symbol not found> (postmaster.c:1894)
10 0xa0fc72 postgres PostmasterMain (postmaster.c:1523)
11 0x6b4cf1 postgres main (main.c:205)
12 0x7f6cf0e2b555 libc.so.6 __libc_start_main + 0xf5
13 0x6c098c postgres <symbol not found> + 0x6c098c


Environment

Product Version: 6.12

Resolution

Workaround

A full recovery (gprecoverseg -F) is required to recover the segments.


Fix

This is fixed in Greenplum v6.16.1, please upgrade to pick up the fix.


Root Cause

gprecoverseg launches pg_rewind and pg_rewind runs postgres --single. The single-mode Postgres will create a shutdown checkpoint at the end and the checkpoint will recycle the xlog files according to wal_keep_segments. In the reproduction case, pg_rewind will then continue to find the diverge record and it is older than the checkpoint we just created. It will find the previous checkpoint, but the xlog file containing the previous checkpoint has been recycled.

This can happen when a segment host is rebooted while the database is running. The segments that were acting as primaries on the host will fail when an incremental recovery is attempted. The segments that were acting as mirrors will most likely be recovered with an incremental recovery.