pg_auto_failover - primary node does not allow writes when all secondaries are down

Products

VMware Tanzu Greenplum

Issue/Introduction

pg_auto_failover uses synchronous replication by default as a part of it's High-Availability architecture. When the secondary node goes down, there is different behavior when only a primary node is left depending on the number of secondaries in the formation.

Environment

Product Version: 13.2

Resolution

We will take a look at a 2-node formation with the following setup:

[postgres@autofailover-m ~]$ pg_autoctl show state
  Name |  Node |           Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+---------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | autofailover-1:5432 | 0/A000230 |       yes |             primary |             primary
node_2 |     2 | autofailover-2:5432 | 0/A000230 |       yes |           secondary |           secondary

And a 3-node formation with the following setup:

[postgres@autofailover1 ~]$ pg_autoctl show state
  Name |  Node |          Host:Port |        LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+------------+-----------+---------------------+--------------------
node_1 |     1 | autofailover2:5432 | 0/13000148 |       yes |             primary |             primary
node_3 |     3 | autofailover4:5432 | 0/13000148 |       yes |           secondary |           secondary
node_5 |     5 | autofailover3:5432 | 0/13000148 |       yes |           secondary |           secondary

When only the primary remains in a 2-node setup, the primary will allow writes as stated in the documentation.

Secondary node is monitored unhealthy

When the secondary node is unhealthy, the monitor assigns to it the state CATCHINGUP and assigns the state WAIT_PRIMARY to the primary node. When implementing the transition from PRIMARY to WAIT_PRIMARY, the keeper disables synchronous replication.

When the keeper reports an acceptable WAL difference in the two nodes again, the replication is upgraded back to being synchronous. While a secondary node is not in the SECONDARY state, secondary promotion is disabled.

However, in a 3+node setup, a different behavior is observed where synchronous replication is not automatically disabled by the monitor and writes to the primary will give the following:

[postgres@autofailover2 ~]$ psql
psql (13.3 (VMware Postgres 13.3.0))
Type "help" for help.

postgres=# create table test1 (id int);
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
CREATE TABLE

This can give the appearance that production is down when in fact this is expected behavior.

2 - Node Formation

If we bring down the secondary, only the primary will be left and the monitor will give the following states:

[postgres@autofailover-m ~]$ pg_autoctl show state
  Name |  Node |           Host:Port |       LSN | Reachable |       Current State |      Assigned State
-------+-------+---------------------+-----------+-----------+---------------------+--------------------
node_1 |     1 | autofailover-1:5432 | 0/A015178 |       yes |        wait_primary |        wait_primary
node_2 |     2 | autofailover-2:5432 | 0/A000230 |        no |           secondary |          catchingup

And the Postgres primary will allow writes:

[postgres@autofailover-1 ~]$ psql
psql (12.7 (VMware Postgres 12.7.0))
Type "help" for help.

postgres=# create table test (id int);
CREATE TABLE

3 - Node Formation

If we bring down all the secondaries, only the primary will be left and the monitor will give the following states:

[postgres@autofailover1 ~]$ pg_autoctl show state
  Name |  Node |          Host:Port |        LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+------------+-----------+---------------------+--------------------
node_1 |     1 | autofailover2:5432 | 0/13000148 |       yes |             primary |             primary
node_3 |     3 | autofailover4:5432 | 0/13000148 |        no |           secondary |          catchingup
node_5 |     5 | autofailover3:5432 | 0/13000148 |        no |           secondary |          catchingup

Notice that the primary is never updated to "wait_primary". This means the primary is waiting for an available secondary to continue replication before it allows writes to the primary. Any attempt to write to the primary will be met with a hanging psql prompt. The pg_auto_failover docs describe the difference in multi-standby setups:

https://pg-auto-failover.readthedocs.io/en/latest/architecture-multi-standby.html

The default value for number_sync_standbys GUC in pg_auto_failover is (number_sync_standbys +1). So in this setup, the value is 1 by default which accounts for synchronous replication not being disabled automatically by the monitor when only the primary is left.

How to Allow Writes to the Primary

You can set number_sync_standbys = 0 to disable this behavior if none of the secondaries are going to be available:

[postgres@autofailover1 ~]$ pg_autoctl set formation number-sync-standbys 0
22:37:20 23160 INFO  Waiting for the settings to have been applied to the monitor and primary node
22:37:20 23160 INFO  New state is reported by node 1 "node_1" (autofailover2:5432): "apply_settings"
22:37:20 23160 INFO  Setting goal state of node 1 "node_1" (autofailover2:5432) to wait_primary because none of the 2 standby candidate nodes are healthy at the moment.
22:37:20 23160 INFO  New state is reported by node 1 "node_1" (autofailover2:5432): "wait_primary"
22:37:20 23160 INFO  primary node has now set synchronous_standby_names = ''
0

[postgres@autofailover1 ~]$ pg_autoctl show state
  Name |  Node |          Host:Port |        LSN | Reachable |       Current State |      Assigned State
-------+-------+--------------------+------------+-----------+---------------------+--------------------
node_1 |     1 | autofailover2:5432 | 0/13010BF8 |       yes |        wait_primary |        wait_primary
node_3 |     3 | autofailover4:5432 | 0/13000148 |        no |           secondary |          catchingup
node_5 |     5 | autofailover3:5432 | 0/13000148 |        no |           secondary |          catchingup

After changing the config, you'll see the primary state updates to "wait_primary" and writes to the database will resume.

Note: You will need to bring up a secondary to re-enable the default behavior