Mirror Segment Crash: requested WAL segment has already been removed
search cancel

Mirror Segment Crash: requested WAL segment has already been removed

book

Article ID: 296608

calendar_today

Updated On:

Products

VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

Under the conditions below, it may be observed that a segment's mirror will go down. An error stating that "requested WAL segment XXXXX has already been removed" will be seen in the mirror segment's logging:

  • The customer has configured the GUC  "max_slot_wal_keep_size"
  • Segment mirror(s) have gone down or are repeatedly going down. The mirror logs will show a stack trace similar to the example below:
2021-04-20 10:16:34.174431 EDT,,,p26701,th843634816,,,,0,,,seg34,,,,,"LOG","00000","started streaming WAL from primary at 17C0/F4000000 on timeline 1",,,,,,,0,,"walreceiver.c",384,
2021-04-20 10:16:34.507388 EDT,,,p26701,th843634816,,,,0,,,seg34,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000017C00000003D has already been removed
",,,,,,,0,,"libpqwalreceiver.c",555,"Stack trace:
1    0xbf1fec postgres errstart (elog.c:557)
2    0xa48151 postgres <symbol not found> (libpqwalreceiver.c:559)
3    0xa3c72c postgres WalReceiverMain (walreceiver.c:435)
4    0x7898fa postgres AuxiliaryProcessMain (bootstrap.c:438)
5    0xa0e9dc postgres <symbol not found> (postmaster.c:5837)
6    0xa10a1f postgres <symbol not found> (postmaster.c:2138)
7    0x7f0c2fa96630 libpthread.so.0 <symbol not found> + 0x2fa96630
8    0x7f0c2ef0f983 libc.so.6 __select + 0x13
9    0x6b29b8 postgres <symbol not found> (postmaster.c:1894)
10   0xa12222 postgres PostmasterMain (postmaster.c:1523)
11   0x6b73b1 postgres main (main.c:205)
12   0x7f0c2ee3c555 libc.so.6 __libc_start_main + 0xf5
13   0x6c30cc postgres <symbol not found> + 0x6c30cc

Steps to confirm:

1. Check the cluster's configuration of the "max_slot_wal_keep_size". It is unlimited (-1) by default:

gpconfig -s max_slot_wal_keep_size

2. Confirm whether the primary for the crashed segment mirror has exceeded this WAL keep size:

du -sh <segment_data_directory>/pg_xlog   # For Greenplum 6.x
or
du -sh <segment_data_directory>/pg_wal # For Greenplum 7.x

Example: [gpadmin@sdw3 ~]$ du -sh /data/primary/gpseg13/pg_xlog 100.3G /data/primary/gpseg13/pg_xlog

 

Environment

 

Cause

Possible reasons for this accumulation of WAL files are:
 

  • If segments are encountering this issue randomly:
    • Network congestion / slow transfer of WAL files from primary to mirror
    • Disk contention on primary or mirror
  • If a certain segment or set of segments are encountering this issue repeatedly:
    • Data skew for one or more tables. 
    • Possible Hardware failure 
  • If 'max_slot_wal_keep_size' is not set, but this stack trace is seen on the mirror segment:
    • Possible hardware failure on primary / mirror. Confirm if the WAL file is present / accessible on the file system
    • Possible file deletion from the pg_xlog directory on primary /  mirror
  • If the mirror has been down for a long period the amount of WAL files can get very large. Mirrors should be recovered as soon as possible to avoid the increase in WAL files on the primary.

 

Resolution

The mirror can be recovered with 'gprecoverseg -F' when this has been encountered.

However, the issue may occur again if the root cause is not found and resolved.

Additional Information

The "max_slot_wal_keep_size" GUC was introduced in GPDB 6.10 to assist in scenarios where the transaction logs (WAL) are not transferring to the mirrors quickly enough and are accumulating on the primary segments. This can lead to a situation where pg_xlog is taking up more space than expected and can fill disks in some scenarios.  

To protect the cluster from this excess space usage scenario, "max_slot_wal_keep_size" gives the customers the option to put a limit on the amount of WAL data. When this limit is encountered, the primary will delete WAL files as necessary to ensure that it does not exceed this limit. This will cause the mirror to fail as the deleted WAL had not been replicated to the mirror segment yet (see max_slot_wal_keep_size for further details on the GUC).