Point-in-time-recovery (PITR) with Patroni

Products

VMware Tanzu Greenplum

Issue/Introduction

Q: Is it possible to perform PITR recovery with Patroni using pgBackrest?
A: As of pgBackrest 2.23 and Patroni 1.6.4, it is only possible to perform a PITR if you bootstrap the cluster. Normally in a single instance setup, you can stop the database and run this command:

pgbackrest --stanza=main --delta  --type=time "--target=<timestamp>" --target-action=promote restore

However, in Patroni this is not possible. If you attempt to run it this way you will get recovery errors on startup. The only way to do it is to use custom bootstrap method which basically recreates the cluster. We will demonstrate this below. We will assume that patroni yaml file is setup with these parameters in the bootstrap section:

bootstrap:
  method: pgbackrest
  pgbackrest:
      command: /home/postgres/custom_bootstrap.sh
      keep_existing_recovery_conf: False
      no_params: False
      recovery_conf:
          recovery_target: immediate
          recovery_target_action: pause
          restore_command: pgbackrest --stanza=main archive-get %f %p

More details on the pgbackrest setup we are using can be found in this article: How to: Use pgbackrest to bootstrap or add replica to HA Patroni

Environment

Product Version: 11.7

Resolution

1. Create Table and insert some data:

postgres=# create table kevin(id int);
CREATE TABLE
postgres=# insert into kevin values(generate_series(1,1000));
INSERT 0 1000

2. Create a full backup using pgbackrest.

3. Note the time:

postgres=# select current_timestamp;
       current_timestamp       
-------------------------------
 2020-07-02 19:20:54.320877+01
(1 row)

4. "Accidentally" drop the table:

postgres=# drop table kevin;
DROP TABLE
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from kevin;
ERROR:  relation "kevin" does not exist
LINE 1: select * from kevin;

5. Stop Patroni nodes and remove the cluster

[postgres@postgres_node_2 ~]$ patronictl -c postgres2.yml remove patroni_backrest
+------------------+-------------+--------------------+------+---------+----+-----------+
|     Cluster      |    Member   |        Host        | Role |  State  | TL | Lag in MB |
+------------------+-------------+--------------------+------+---------+----+-----------+
| patroni_backrest | postgresql1 | 10.193.102.52:5532 |      | stopped |    |   unknown |
+------------------+-------------+--------------------+------+---------+----+-----------+
Please confirm the cluster name to remove: patroni_backrest
You are about to remove all information in DCS for patroni_backrest, please type: "Yes I am aware": Yes I am aware

6. Change custom_bootstrap.sh and put the time we observed earlier. This is the restore point when the table still existed:

#!/bin/sh

mkdir -p /home/postgres/data
pgbackrest --stanza=main  --log-level-console=info --delta --type=time "--target=2020-07-02 19:20:54.320877+01" --target-action=promote restore

7. Move or delete the old data directory
8. Start Patroni. The log file should look like this:

2020-07-02 19:30:41.147 P01   INFO: restore file /home/postgres/data/base/1/13723 (0B, 100%)
2020-07-02 19:30:41.149 P01   INFO: restore file /home/postgres/data/base/1/13718 (0B, 100%)
2020-07-02 19:30:41.149 P00   INFO: write /home/postgres/data/recovery.conf
2020-07-02 19:30:41.156 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2020-07-02 19:30:41,263 INFO: Lock owner: None; I am postgresql1
2020-07-02 19:30:41,263 INFO: not healthy enough for leader race
2020-07-02 19:30:41,269 INFO: bootstrap in progress
2020-07-02 19:30:41.460 P00   INFO: restore command end: completed successfully (9943ms)
2020-07-02 19:30:41.924 BST [23510] LOG:  listening on IPv4 address "0.0.0.0", port 5532
2020-07-02 19:30:41.924 BST [23510] LOG:  listening on IPv6 address "::", port 5532
2020-07-02 19:30:41.938 BST [23510] LOG:  listening on Unix socket "./.s.PGSQL.5532"
2020-07-02 19:30:41.966 BST [23510] LOG:  redirecting log output to logging collector process
2020-07-02 19:30:41.966 BST [23510] HINT:  Future log output will appear in directory "pg_log".
2020-07-02 19:30:41,977 INFO: postmaster pid=23510
localhost:5532 - rejecting connections
localhost:5532 - rejecting connections
localhost:5532 - accepting connections
2020-07-02 19:30:43,021 INFO: establishing a new patroni connection to the postgres cluster
2020-07-02 19:30:43,034 INFO: waiting for end of recovery after bootstrap

9. Run pg_wal_replay_resume();

psql -h localhost -p 5532 -c 'select pg_wal_replay_resume();'

10. Once you get the leader lock, the table should be back:

[postgres@postgres_node_2 ~]$ psql -h localhost -p 5532
psql (11.7 (VMware Postgres 11.7.3))
Type "help" for help.

postgres=# select * from kevin;
  id  
------
    1
    2
    3
    4
    5
    6
    7

11. Add the replicas per your normal process.