Unable to start Patroni after refreshing UAT postgres data structure (complete refresh) from PROD postgres data structure (complete backup) in VMware Postgres

search cancel

Unable to start Patroni after refreshing UAT postgres data structure (complete refresh) from PROD postgres data structure (complete backup) in VMware Postgres

book

Article ID: 296374

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

After refreshing your UAT Postgres instance (complete PGDATA structures) from PROD Postgres instance, Patroni doesn't want to start.

The following error message is found in the Patroni Log:

2020-05-14 12:02:57,289 CRITICAL: system ID mismatch, node pg_test_bb12345_pt3_536_h333_7347_0001 belongs to a different cluster: 5237488000333777727 !=  6353552553353555259

This means that the current refreshed Postgres instance Database System Identifier (6353552553353555259) doesn't match the ETCD value for the Database System identifier from the old Postgres instance structures (5237488000333777727).

Environment

Product Version: 11.2

Resolution

IMPORTANT NOTE: this example is used starting with fixing the Leader node within the patroni cluster first.

To view both values, you can issue the following commands:

Run the command below to get the Postgres instance Database System identifier. This can be done while Postgres is up or down.

pg_controldata -D <data directory structure> | grep "Database system identifier"

Note: To find <data directory structure>, you can issue the following command within Psql: 'show data_directory'. Or use the "Database system identifier" if PGDATA environment variable is defined.

Run the command below to get the ETCD value for the Database System identifier.

Note: ETCD will pickup and store the Database System Identifier from the Postgres instance the first time Patroni is started.

etcdctl get /service/<scope from yaml file>/initialize 

Note: pick scope attribute value and not name from yaml

Run the command below to get the yaml Leader/Master file scope attribute (/data/patroni.yml):

#Unique cluster name

scope: pg_cluster_bb12345

name: pg_test_bb12345_pt3_536_h333_7347_0001

...

...

Example

After Postgres was backed up with structures and the PGDATA directory was completely refreshed on UAT:

New postgres instance PGDATA structures:

pg_controldata -D /usr/local/postgresql/data

…

…

Database system identifier:      6353552553353555259

Database cluster state:        shut down

…

…

Old postgres instance PGDATA structures:

pg_controldata -D /usr/local/postgresql/data_bkup

…

…

Database system identifier:      5237488000333777727

Database cluster state:        shut down

…

…

The ETCD value:

etcdctl get /service/pg_cluster_bb12345/initialize

5237488000333777727

Note: The scope attribute value is used

As we can see, the example shows a discrepancy with new Postgres Instance Database System Identifier and the stored value in ETCD. The ETCD value has the same value as the OLD Postgres Instance Database System identifier.

Getting Leader up and running

To fix this issue, you will need to clear out the Database System Identifier from the ETCD Data Store and then start Patroni. This will trigger within ETCD to reset the Database System Identifier attribute with the Database System Identifier from new Postgres Instance structure.

1. Clear ETCD stored value.

Note: Replace <yaml file directory> in the command with the full path of where the Patroni yaml file exists. In other words, patronictl -c <yaml file directory> remove <yaml scope attribute value>.

patronictl -c /data/patroni.yml remove pg_cluster_bb12345

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

|   Cluster          |    Member                              |      Host     |  Role         |  State  | TL.     | Lag in MB |

+-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

Please confirm the cluster name to remove: pg_cluster_bb12345

You are about to remove all information in DCS for pg_cluster_bb12345, please type: “Yes I am aware”: Yes I am aware

Verify that data store value has been cleared out:

etcdctl get /service/pg_cluster_bb12345/initialize

Error:  100: Key not found (/service/pg_cluster_bb12345) [23052]

2. Start Patroni and verify that the Leader is up and the Database System Identifier value from ETCD is identical tot the Postgres Database System Identifier.

systemctl start patroni

3. Review the Patroni cluster host list to make sure the leader is up:

patronictl -c <yaml file> list

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

|   Cluster          |    Member                              |      Host     |  Role         |  State  | TL.     | Lag in MB |

+-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+

| pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.34  | Leader        | running |  2      |         0 |

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

etcdctl get /service/pg_cluster_bb12345/initialize

6353552553353555259

Getting replicas up and running

1. ETCD is its own cluster with replication capabilities. When we reset the Database System Identifier in ETCD instance on Leader/Master, it should have replicated over to ETCD instances on the replicas but it is a good idea to verify. From the below command, it seems like ETCD replication has worked.

Leader host> etcdctl get /service/prototype-eu/initialize

6353552553353555259

Replica1 host> etcdctl get /service/prototype-eu/initialize

6353552553353555259

Replica1 host> etcdctl get /service/prototype-eu/initialize

6353552553353555259

2. Follow this step on every replica. We need to remove (or rename) the current old data structure on the replica host. Once removed or renamed, as Patroni is started, the cluster will realize that this is a replica and that the Postgres data structure doesn't exist.

It will then perform a backup of Leader Postgres data structure and bring it over to the replica. At this point in time, the current Postgres Database System Identifier will be identical to the ETCD value. therefore, Patroni should start up.

systemctl start patroni

3. Again, review your Patroni cluster host list to make sure the leader and both replica hosts are up.

patronictl -c <yaml file> list

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

|   Cluster          |    Member                              |      Host     |  Role         |  State  | TL.     | Lag in MB |

+-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+

| pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.35  |               | running |  2      |         0 |

| pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.34  | Leader        | running |  2      |         0 |

| pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.36  |               | running |  2      |         0 |

+--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+

Everything is up and working now.

Feedback

thumb_up Yes

thumb_down No