Under the conditions below, it may be observed that a segment's mirror will go down. An error stating that "requested WAL segment XXXXX has already been removed" will be seen in the mirror segment's logging:
2021-04-20 10:16:34.174431 EDT,,,p26701,th843634816,,,,0,,,seg34,,,,,"LOG","00000","started streaming WAL from primary at 17C0/F4000000 on timeline 1",,,,,,,0,,"walreceiver.c",384, 2021-04-20 10:16:34.507388 EDT,,,p26701,th843634816,,,,0,,,seg34,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000001000017C00000003D has already been removed ",,,,,,,0,,"libpqwalreceiver.c",555,"Stack trace: 1 0xbf1fec postgres errstart (elog.c:557) 2 0xa48151 postgres <symbol not found> (libpqwalreceiver.c:559) 3 0xa3c72c postgres WalReceiverMain (walreceiver.c:435) 4 0x7898fa postgres AuxiliaryProcessMain (bootstrap.c:438) 5 0xa0e9dc postgres <symbol not found> (postmaster.c:5837) 6 0xa10a1f postgres <symbol not found> (postmaster.c:2138) 7 0x7f0c2fa96630 libpthread.so.0 <symbol not found> + 0x2fa96630 8 0x7f0c2ef0f983 libc.so.6 __select + 0x13 9 0x6b29b8 postgres <symbol not found> (postmaster.c:1894) 10 0xa12222 postgres PostmasterMain (postmaster.c:1523) 11 0x6b73b1 postgres main (main.c:205) 12 0x7f0c2ee3c555 libc.so.6 __libc_start_main + 0xf5 13 0x6c30cc postgres <symbol not found> + 0x6c30cc
Steps to confirm:
1. Check the cluster's configuration of the "max_slot_wal_keep_size". It is unlimited (-1) by default:
gpconfig -s max_slot_wal_keep_size
2. Confirm whether the primary for the crashed segment mirror has exceeded this WAL keep size:
du -sh <segment_data_directory>/pg_xlog # For Greenplum 6.x
or
du -sh <segment_data_directory>/pg_wal # For Greenplum 7.x
Example: [gpadmin@sdw3 ~]$ du -sh /data/primary/gpseg13/pg_xlog 100.3G /data/primary/gpseg13/pg_xlog
Possible reasons for this accumulation of WAL files are:
The mirror can be recovered with 'gprecoverseg -F' when this has been encountered.
However, the issue may occur again if the root cause is not found and resolved.
The "max_slot_wal_keep_size" GUC was introduced in GPDB 6.10 to assist in scenarios where the transaction logs (WAL) are not transferring to the mirrors quickly enough and are accumulating on the primary segments. This can lead to a situation where pg_xlog is taking up more space than expected and can fill disks in some scenarios.
To protect the cluster from this excess space usage scenario, "max_slot_wal_keep_size" gives the customers the option to put a limit on the amount of WAL data. When this limit is encountered, the primary will delete WAL files as necessary to ensure that it does not exceed this limit. This will cause the mirror to fail as the deleted WAL had not been replicated to the mirror segment yet (see max_slot_wal_keep_size for further details on the GUC).