After refreshing your UAT Postgres instance (complete PGDATA structures) from PROD Postgres instance, Patroni doesn't want to start.
The following error message is found in the Patroni Log:
2020-05-14 12:02:57,289 CRITICAL: system ID mismatch, node pg_test_bb12345_pt3_536_h333_7347_0001 belongs to a different cluster: 5237488000333777727 != 6353552553353555259
This means that the current refreshed Postgres instance Database System Identifier (6353552553353555259) doesn't match the ETCD value for the Database System identifier from the old Postgres instance structures (5237488000333777727).
IMPORTANT NOTE: this example is used starting with fixing the Leader node within the patroni cluster first.
To view both values, you can issue the following commands:
pg_controldata -D <data directory structure> | grep "Database system identifier"
Note: To find <data directory structure>, you can issue the following command within Psql: 'show data_directory'. Or use the "Database system identifier" if PGDATA environment variable is defined.
etcdctl get /service/<scope from yaml file>/initialize Note: pick scope attribute value and not name from yaml
#Unique cluster name scope: pg_cluster_bb12345 name: pg_test_bb12345_pt3_536_h333_7347_0001 ... ...
After Postgres was backed up with structures and the PGDATA directory was completely refreshed on UAT:
pg_controldata -D /usr/local/postgresql/data … … Database system identifier: 6353552553353555259 Database cluster state: shut down … …
pg_controldata -D /usr/local/postgresql/data_bkup … … Database system identifier: 5237488000333777727 Database cluster state: shut down … …
etcdctl get /service/pg_cluster_bb12345/initialize 5237488000333777727 Note: The scope attribute value is used
As we can see, the example shows a discrepancy with new Postgres Instance Database System Identifier and the stored value in ETCD. The ETCD value has the same value as the OLD Postgres Instance Database System identifier.
To fix this issue, you will need to clear out the Database System Identifier from the ETCD Data Store and then start Patroni. This will trigger within ETCD to reset the Database System Identifier attribute with the Database System Identifier from new Postgres Instance structure.
1. Clear ETCD stored value.
Note: Replace <yaml file directory> in the command with the full path of where the Patroni yaml file exists. In other words, patronictl -c <yaml file directory> remove <yaml scope attribute value>.
patronictl -c /data/patroni.yml remove pg_cluster_bb12345 +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+ | Cluster | Member | Host | Role | State | TL. | Lag in MB | +-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+ +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+
Please confirm the cluster name to remove: pg_cluster_bb12345
You are about to remove all information in DCS for pg_cluster_bb12345, please type: “Yes I am aware”: Yes I am aware
Verify that data store value has been cleared out:
etcdctl get /service/pg_cluster_bb12345/initialize Error: 100: Key not found (/service/pg_cluster_bb12345) [23052]
2. Start Patroni and verify that the Leader is up and the Database System Identifier value from ETCD is identical tot the Postgres Database System Identifier.
systemctl start patroni
3. Review the Patroni cluster host list to make sure the leader is up:
patronictl -c <yaml file> list +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+ | Cluster | Member | Host | Role | State | TL. | Lag in MB | +-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+ | pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.34 | Leader | running | 2 | 0 | +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+ etcdctl get /service/pg_cluster_bb12345/initialize 6353552553353555259
1. ETCD is its own cluster with replication capabilities. When we reset the Database System Identifier in ETCD instance on Leader/Master, it should have replicated over to ETCD instances on the replicas but it is a good idea to verify. From the below command, it seems like ETCD replication has worked.
Leader host> etcdctl get /service/prototype-eu/initialize 6353552553353555259 Replica1 host> etcdctl get /service/prototype-eu/initialize 6353552553353555259 Replica1 host> etcdctl get /service/prototype-eu/initialize 6353552553353555259
2. Follow this step on every replica. We need to remove (or rename) the current old data structure on the replica host. Once removed or renamed, as Patroni is started, the cluster will realize that this is a replica and that the Postgres data structure doesn't exist.
It will then perform a backup of Leader Postgres data structure and bring it over to the replica. At this point in time, the current Postgres Database System Identifier will be identical to the ETCD value. therefore, Patroni should start up.
systemctl start patroni
3. Again, review your Patroni cluster host list to make sure the leader and both replica hosts are up.
patronictl -c <yaml file> list +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+ | Cluster | Member | Host | Role | State | TL. | Lag in MB | +-------------------------------------------------------------+---------------+---------------+---------+---------+-----------+ | pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.35 | | running | 2 | 0 | | pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.34 | Leader | running | 2 | 0 | | pg_cluster_bb12345 | pg_test_bb12345_pt3_536_h333_7347_0001 | 10.18.233.36 | | running | 2 | 0 | +--------------------+----------------------------------------+---------------+---------------+---------+---------+-----------+
Everything is up and working now.