GPDR restore error

Products

VMware Tanzu Greenplum Greenplum Pivotal Data Suite Non Production Edition VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

GPDR restore failed due to the following errors.

The first error can be found in the file ~/gpAdminLogs/gpdr_<date>.log on the coordinator/master host:

Error occurred while running command "restore" on the cluster: Could not restore backup

The second error can be found in the file /usr/local/gpdr/logs/gpdb-segXX-restore.log on the segment host(s):

ERROR: [038]: unable to restore while PostgreSQL is running
       HINT: presence of 'postmaster.pid' in '/data/master/gpsegXX' indicates PostgreSQL is running.
       HINT: remove 'postmaster.pid' only if PostgreSQL is not running.

gpstate reports that the DR cluster is not running:

[CRITICAL]:-gpstate failed. (Reason='could not connect to server: Connection refused
        Is the server running on host "localhost" (::1) and accepting
        TCP/IP connections on port 5432?
could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?
') exiting...

Cause

If there are 2 "gpdr restore" commands running simultaneously, then the second command to start will probably fail and cancel the first one when it is only partially done.

It is important to avoid running 2 "gpdr restores" concurrently.

Resolution

Workaround

Confirm that the pg_control data files exist in the repository. For that we can execute the following command:

1. source /usr/local/greenplum-db/greenplum_path.sh && pgbackrest --log-level-console warn --stanza gpdb-seg2 --config /usr/local/gpdr/configs/pgbackrest-seg2.conf repo-ls backup/gpdb-seg2/<DATE>/pg_data/global This example is for seg2.
2. The output should show pg_control.gz. Run this command on any one of the segments after making the correct segment number modifications in the command above.
To restore the DR cluster:

1. Make sure all Postgres processes are stopped on coordinator and all segment hosts.
2. Make sure the sockets in /tmp directories on each host deleted.
3. Manually remove all postmaster.pid files in the coordinator and segment data directories.
4. Retry the gpdr restore.

Fix

R&D are developing measures to avoid concurrent gpdr restores.