Let's assume this scenario - the member [postgresql2] failed to start and pg_ctl start does not work either:
-bash-4.2$ patronictl -c patroni_etcd_conf.d/postgres_member1.yaml list +-----------------+-------------+-------------------+--------+--------------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------------+-------------+-------------------+--------+--------------+----+-----------+ | patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader | running | 31 | | | patroni_cluster | postgresql2 | 172.28.8.251:7433 | | start failed | | unknown | | patroni_cluster | postgresql3 | 172.28.8.251:7434 | | running | 31 | 0 | -bash-4.2$ pg_ctl start -D ./data2/ waiting for server to start....2020-10-06 21:58:33.353 CST [20916] LOG: listening on IPv4 address "172.28.8.251", port 7433 2020-10-06 21:58:33.355 CST [20916] LOG: listening on Unix socket "./.s.PGSQL.7433" 2020-10-06 21:58:33.365 CST [20916] LOG: redirecting log output to logging collector process 2020-10-06 21:58:33.365 CST [20916] HINT: Future log output will appear in directory "my_pg_log". stopped waiting pg_ctl: could not start server Examine the log output.
From the Patroni log we can see Patroni tried pg_rewind but it did not work due to timeline divergence.
[postgesql2] is still in timeline 28, while others are already in timeline 31.
2020-10-06 21:41:24,850 INFO: running pg_rewind from postgresql1 2020-10-06 21:41:24,869 INFO: running pg_rewind from dbname=postgres user=rewind_user host=172.28.8.251 port=7432 servers diverged at WAL location 0/701A3E8 on timeline 28 no rewind required 2020-10-06 21:41:24,887 WARNING: Postgresql is not running.
From pg_log, we can see [postgresql2] lacks the log files to make itself enter timeline 31:
2020-10-06 21:42:13.716 CST [17711] LOG: database system was shut down in recovery at 2020-10-06 21:41:24 CST 2020-10-06 21:42:13.717 CST [17711] LOG: entering standby mode 2020-10-06 21:42:13.717 CST [17711] FATAL: requested timeline 31 does not contain minimum recovery point 0/701A420 on timeline 28 2020-10-06 21:42:13.718 CST [17709] LOG: startup process (PID 17711) exited with exit code 1 2020-10-06 21:42:13.718 CST [17709] LOG: aborting startup due to startup process failure 2020-10-06 21:42:13.719 CST [17709] LOG: database system is shut down
When this kind of issue occurs in a Patroni cluster, we can recover the failed instance using patronictl reinit in most cases.
In this case, we reinit the member [postgresql2]:
-bash-4.2$ patronictl -c postgres_member1.yaml reinit patroni_cluster postgresql2 +-----------------+-------------+-------------------+--------+--------------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------------+-------------+-------------------+--------+--------------+----+-----------+ | patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader | running | 31 | | | patroni_cluster | postgresql2 | 172.28.8.251:7433 | | start failed | | unknown | | patroni_cluster | postgresql3 | 172.28.8.251:7434 | | running | 31 | 0 | +-----------------+-------------+-------------------+--------+--------------+----+-----------+ Are you sure you want to reinitialize members postgresql2? [y/N]: y Success: reinitialize for member postgresql2
From the Patorni log, we can see the process of initialization of [postgresql2]:
2020-10-06 22:16:32,237 INFO: Removing data directory: /var/lib/pgsql/data2 2020-10-06 22:16:33,378 INFO: Lock owner: postgresql1; I am postgresql2 2020-10-06 22:16:33,380 INFO: reinitialize in progress WARNING: skipping special file "./.s.PGSQL.7432" 2020-10-06 22:16:40,152 INFO: replica has been created using basebackup 2020-10-06 22:16:40,153 INFO: bootstrapped from leader 'postgresql1' 2020-10-06 22:16:40.525 CST [24249] LOG: listening on IPv4 address "172.28.8.251", port 7433 2020-10-06 22:16:40.526 CST [24249] LOG: listening on Unix socket "./.s.PGSQL.7433" 2020-10-06 22:16:40.532 CST [24249] LOG: redirecting log output to logging collector process 2020-10-06 22:16:40.532 CST [24249] HINT: Future log output will appear in directory "my_pg_log". 2020-10-06 22:16:40,541 INFO: postmaster pid=24249 172.28.8.251:7433 - rejecting connections 172.28.8.251:7433 - rejecting connections 172.28.8.251:7433 - accepting connections
Finally, we can find [postgresql2] is back to normal:
-bash-4.2$ patronictl -c postgres_member1.yaml list +-----------------+-------------+-------------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-----------------+-------------+-------------------+--------+---------+----+-----------+ | patroni_cluster | postgresql1 | 172.28.8.251:7432 | Leader | running | 31 | | | patroni_cluster | postgresql2 | 172.28.8.251:7433 | | running | 31 | 0 | | patroni_cluster | postgresql3 | 172.28.8.251:7434 | | running | 31 | 0 | +-----------------+-------------+-------------------+--------+---------+----+-----------+