How to recover a Patroni PostgreSQL instance when it is fails to start
search cancel

How to recover a Patroni PostgreSQL instance when it is fails to start

book

Article ID: 295251

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Let's assume this scenario - the member [postgresql2] failed to start and pg_ctl start does not work either:
-bash-4.2$ patronictl -c patroni_etcd_conf.d/postgres_member1.yaml list
+-----------------+-------------+-------------------+--------+--------------+----+-----------+
|     Cluster     |    Member   |        Host       |  Role  |    State     | TL | Lag in MB |
+-----------------+-------------+-------------------+--------+--------------+----+-----------+
| patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader |   running    | 31 |           |
| patroni_cluster | postgresql2 | 172.28.8.251:7433 |        | start failed |    |   unknown |
| patroni_cluster | postgresql3 | 172.28.8.251:7434 |        |   running    | 31 |         0 |

-bash-4.2$ pg_ctl start -D ./data2/
waiting for server to start....2020-10-06 21:58:33.353 CST [20916] LOG:  listening on IPv4 address "172.28.8.251", port 7433
2020-10-06 21:58:33.355 CST [20916] LOG:  listening on Unix socket "./.s.PGSQL.7433"
2020-10-06 21:58:33.365 CST [20916] LOG:  redirecting log output to logging collector process
2020-10-06 21:58:33.365 CST [20916] HINT:  Future log output will appear in directory "my_pg_log".
 stopped waiting
pg_ctl: could not start server
Examine the log output.

From the Patroni log we can see Patroni tried pg_rewind but it did not work due to timeline divergence.

[postgesql2] is still in timeline 28, while others are already in timeline 31.
2020-10-06 21:41:24,850 INFO: running pg_rewind from postgresql1
2020-10-06 21:41:24,869 INFO: running pg_rewind from dbname=postgres user=rewind_user host=172.28.8.251 port=7432
servers diverged at WAL location 0/701A3E8 on timeline 28
no rewind required
2020-10-06 21:41:24,887 WARNING: Postgresql is not running.

From pg_log, we can see [postgresql2] lacks the log files to make itself enter timeline 31:
2020-10-06 21:42:13.716 CST [17711] LOG:  database system was shut down in recovery at 2020-10-06 21:41:24 CST
2020-10-06 21:42:13.717 CST [17711] LOG:  entering standby mode
2020-10-06 21:42:13.717 CST [17711] FATAL:  requested timeline 31 does not contain minimum recovery point 0/701A420 on timeline 28
2020-10-06 21:42:13.718 CST [17709] LOG:  startup process (PID 17711) exited with exit code 1
2020-10-06 21:42:13.718 CST [17709] LOG:  aborting startup due to startup process failure
2020-10-06 21:42:13.719 CST [17709] LOG:  database system is shut down


Resolution

When this kind of issue occurs in a Patroni cluster, we can recover the failed instance using patronictl reinit in most cases.

In this case, we reinit the member [postgresql2]:
-bash-4.2$ patronictl -c postgres_member1.yaml reinit patroni_cluster postgresql2                    
+-----------------+-------------+-------------------+--------+--------------+----+-----------+
|     Cluster     |    Member   |        Host       |  Role  |    State     | TL | Lag in MB |
+-----------------+-------------+-------------------+--------+--------------+----+-----------+
| patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader |   running    | 31 |           |
| patroni_cluster | postgresql2 | 172.28.8.251:7433 |        | start failed |    |   unknown |
| patroni_cluster | postgresql3 | 172.28.8.251:7434 |        |   running    | 31 |         0 |
+-----------------+-------------+-------------------+--------+--------------+----+-----------+
Are you sure you want to reinitialize members postgresql2? [y/N]: y
Success: reinitialize for member postgresql2

From the Patorni log, we can see the process of initialization of [postgresql2]:
2020-10-06 22:16:32,237 INFO: Removing data directory: /var/lib/pgsql/data2
2020-10-06 22:16:33,378 INFO: Lock owner: postgresql1; I am postgresql2
2020-10-06 22:16:33,380 INFO: reinitialize in progress
WARNING:  skipping special file "./.s.PGSQL.7432"
2020-10-06 22:16:40,152 INFO: replica has been created using basebackup
2020-10-06 22:16:40,153 INFO: bootstrapped from leader 'postgresql1'
2020-10-06 22:16:40.525 CST [24249] LOG:  listening on IPv4 address "172.28.8.251", port 7433
2020-10-06 22:16:40.526 CST [24249] LOG:  listening on Unix socket "./.s.PGSQL.7433"
2020-10-06 22:16:40.532 CST [24249] LOG:  redirecting log output to logging collector process
2020-10-06 22:16:40.532 CST [24249] HINT:  Future log output will appear in directory "my_pg_log".
2020-10-06 22:16:40,541 INFO: postmaster pid=24249
172.28.8.251:7433 - rejecting connections
172.28.8.251:7433 - rejecting connections
172.28.8.251:7433 - accepting connections

Finally, we can find [postgresql2] is back to normal:
-bash-4.2$ patronictl -c postgres_member1.yaml list
+-----------------+-------------+-------------------+--------+---------+----+-----------+
|     Cluster     |    Member   |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------------+-------------+-------------------+--------+---------+----+-----------+
| patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader | running | 31 |           |
| patroni_cluster | postgresql2 | 172.28.8.251:7433 |        | running | 31 |         0 |
| patroni_cluster | postgresql3 | 172.28.8.251:7434 |        | running | 31 |         0 |
+-----------------+-------------+-------------------+--------+---------+----+-----------+