Managing Full Disk Issues on Greenplum Database Due to WAL (Write-Ahead Logs) Accumulation

search cancel

Managing Full Disk Issues on Greenplum Database Due to WAL (Write-Ahead Logs) Accumulation

book

Article ID: 392413

calendar_today

Updated On:

Products

Greenplum VMware Tanzu Greenplum VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

The filesystem holding the data directory of segments is getting close to 100% full.

There is a large amount of WAL files building up on one or more segments in the pg_xlog (GPDB 6.x) or pg_wal (GPDB 7.x) directory.

Cause

Cause 1

Under normal operations, the WAL files will be replicated from the primary to the mirrors and then deleted once they are no longer required for recovery in the event of a shutdown or failure.

If the mirror goes down or there is a failover, then the WAL files cannot be copied to the mirror and there can be a large build up of WAL files on the primary. This can be the case if the mirror segment is not recovered for an extended period of time.

The build up of WAL files can cause GPDB to become unresponsive or go down if the disk becomes full.

Cause 2

The parameter archive_command is set and not successfully completing. If Greenplum Disaster Recovery (GPDR) is installed and configured, it will set the archive_command to run pgbackrest to archive the WAL files.

If it is unable to archive the WAL files due to the archive repository is unavailable, then the number of WAL files kept on the segments can increase over time.

Check the coordinator log and the segment log files for error messages like "The failed archive command was ..."

Resolution

Resoultion 1

The resolution should be used if the issue is with a mirror being down and cannot be recovered for a long period of time.

The amount of WAL files kept by the segment can be controlled by max_slot_wall_keep_size parameter.

To avoid filling the filesystem with WAL files, set the GUC to a reasonable value (set in MegaBytes):

# For example set the value to 400MB for all slots
gpconfig -c max_slot_wal_keep_size -v 400

# Reload the config and make the setting take effect.
gpstop -u

Note/Caution:

If max_slot_wal_keep_size is set to a non-default value for acting primaries, full and incremental recovery of their mirrors may not be possible. Depending on the workload on the primary running concurrently with a full recovery, the recovery may fail with a missing WAL error. Therefore, you must ensure that max_slot_wal_keep_size is set to the default of -1 or a high enough value before running full recovery. Similarly, depending on how behind the downed mirror is, an incremental recovery of it may fail with a missing WAL error. In this case, full recovery would be the only recourse.

Resolution 2

This resolution should be used if the issue is with the archive_command.

Check the log files for archive error messages, there may be specific information on why it is failing.

Ensure the archive repository is available on the host reporting the errors.

Feedback

thumb_up Yes

thumb_down No