How to open Postgres disaster recovery database in Read Write mode

Products

VMware Tanzu PostgreSQL

Issue/Introduction

This guide will walk you through the process of testing failover using repmgr with PostgreSQL High Availability (HA) configurations. The procedure involves pausing replication, promoting a standby instance to primary, performing data loading tests, and then either reverting or recreating the standby instance depending on the outcome of the tests.

Prerequisites

You have a primary PostgreSQL instance and a standby PostgreSQL instance already set up with repmgr configured for HA.
You have admin access to PostgreSQL (psql) on both the primary and standby instances.
The PostgreSQL environment is configured with wal_log_hints enabled or data checksums enabled on the primary to support pg_rewind.

1. Set the Primary Instance to Read-Only Mode

The primary instance will be set to read-only mode during this process. This ensures that any changes made to the primary after the promotion of the standby instance won’t conflict when the standby is reverted back later.

Steps:

1.1. On the primary instance, log into psql as an admin user:

This command makes the primary instance read-only. Any new transactions on the primary will be blocked.

1.2. After making this change, the configuration is reloaded automatically.

2. Promote the Standby Instance

To simulate a failover, the standby instance must be promoted to become the primary. This will allow you to perform your tests without shutting down the current primary.

Steps:

2.1. On the standby instance host, execute the following command to promote the standby:

Important Caution: This promotion will result in both the original primary and the promoted standby becoming active primary instances. This is expected since we are not shutting down the original primary instance during testing.

Note: We cannot use the repmgr standby promote command here because it requires the original primary to be down. Instead, pg_ctl promote directly promotes the standby instance.

3. Perform Data Loading Tests on the Promoted Standby

With the standby promoted to primary, you can now proceed with your data loading tests.

Steps:

3.1. On the promoted standby instance, perform the necessary data loading and testing steps (e.g., inserting data, running queries, etc.).

4. Revert the Promoted Standby (if necessary)

After completing the tests, you may want to revert the promoted standby back to a standby role. This involves using pg_rewind to sync the data with the primary.

4a. Revert the Standby Using `pg_rewind` (Recommended)

To revert the promoted standby, you'll use pg_rewind. However, this requires that either wal_log_hints or data checksums be enabled on the primary instance. This is because pg_rewind relies on these features to work correctly.

Steps:

4a.1. On the standby instance host, stop the promoted standby instance:

4a.2. Rejoin the standby to the primary and perform the rewind operation:

This command will sync the standby with the primary instance.

5. Revert the Primary Instance to Write-Enabled Mode

Once the promoted standby is reverted, you can return the primary instance to its normal writable state.

Steps:

5.1. On the primary instance, log into psql and execute the following:

This will make the primary instance writable again.

6. If `pg_rewind` Fails, Recreate the Standby

In case pg_rewind fails due to missing WAL segments or other issues (such as missing data checksums or wal_log_hints being off), you’ll need to recreate the standby instance completely.

Steps:

6.1. On the primary instance, log into psql and run the following to ensure the primary is write-enabled:

6.2. On the standby instance host, stop the standby:

6.3. Clone the standby from the primary:

6.4. Start the newly cloned standby:

6.5. Register the new standby with repmgr:

7. Verify Cluster Status

Once all the above steps are completed, verify the status of the cluster to ensure everything is running normally.

Steps:

7.1. On any node (primary or standby), run:

This will show you the current state of the cluster and confirm that everything is functioning as expected.

Environment

VMware Postgres 15.6.0

Resolution

Conclusion

By following these steps, you can simulate a failover in your PostgreSQL cluster, test your failover mechanisms, and revert back to the original configuration if necessary. Whether you use pg_rewind to restore the original standby or recreate the standby from scratch, this procedure ensures that your PostgreSQL high availability setup is resilient and functional.

Always make sure to take proper backups before performing any failover or reconfiguration steps in a production environment.