"FATAL pg_autoctl does not know how to reach state "XX" from "YY"...Failed to transition to state" error in Postgres

search cancel

"FATAL pg_autoctl does not know how to reach state "XX" from "YY"...Failed to transition to state" error in Postgres

book

Article ID: 296960

calendar_today

Updated On: 04-20-2024

Products

VMware Tanzu Greenplum

Issue/Introduction

The pg_autofailover cluster or node is in an unexpected state and it cannot transition from the current state to the next state.

For example:

Name   | Node  | Host:Port                             | LSN         | Reachable | Current State       | Assigned State
-------+-------+---------------------------------------+-------------+-----------+---------------------+--------------------
node_3 | 5     | qssccdpdbl04v.qasgxdo.qasgx.com:55432 | 29/C1310080 | yes       | secondary           | secondary
node_4 | 8     | qpsccdpdbl05v.qasgxdo.qasgx.com:55432 | 29/C1000000 | yes       | prepare_promotion   | catchingup
node_1 | 9     | qpsccdpdbl04v.qasgxdo.qasgx.com:55432 | 29/C142EBD0 | yes       | primary             | primary

The pg_autofailover log reports the following:

Jul 21 20:13:43 qpsccdpdbl05v pg_autoctl[23035]: 20:13:43 23040 INFO Monitor assigned new state "catchingup"
Jul 21 20:13:43 qpsccdpdbl05v pg_autoctl[23035]: 20:13:43 23040 FATAL pg_autoctl does not know how to reach state "catchingup" from "prepare_promotion"
Jul 21 20:13:43 qpsccdpdbl05v pg_autoctl[23035]: 20:13:43 23040 ERROR Failed to transition to state "catchingup", retrying...

Environment

Product Version: 10.15

Resolution

Workaround 1: Enable and disable maintenance on the affected node

Log into the affected node as a Postgres user

1. Check state of the node/cluster with this command::

pg_autoctl show state

2. Enable maintenance with this command:

pg_autoctl enable maintenance

3. If it successfully transitions to maintenance state, then disable maintenance:

pg_autoctl disable maintenance

4. Check the state with this command:

pg_autoctl show state

Workaround 2: Drop and recreate the node

It is possible to drop and recreate then node without needing to do a full reinitialization and pg_basebackup.

The creation of the node should be able to start from the last known good replication point. Log into the affected node as Postgres user and follow these steps:

1. Check state of the node or cluster with this command:

pg_autoctl show state

2. Drop the node. Do NOT use the "--destroy" option in order to keep the current data.

pg_autoctl drop node

3. Get the current config of the node:

pg_autoctl config get

4. Create the node with the required options. The parameter settings can be found in the information from the previous command.

pg_autoctl create postgres --auth authMode --pgdata dataDirectory ...

5. It may be necessary to log in as root and start the pgautofailover service:

systemctl start pgautofailover

Feedback

thumb_up Yes

thumb_down No