pg_auto_failover official documentation does not provide steps for performing a major PostgreSQL version upgrade. This guide documents the procedure as validated in our internal lab environment.pg_hba.conf, and authentication methods.pg_auto_failover cluster setup), the recommended alternative is to provision a fresh cluster running PostgreSQL 17 managed by pg_auto_failover and use pg_dump/pg_dumpall and pg_restore to migrate the data. This avoids touching the existing production cluster until the new one is fully validated.| Item | Details |
|---|---|
| Cluster topology | 1 monitor + 1 primary + 1 secondary |
| Upgrade method | pg_upgrade --link on monitor and primary; secondary rebuilt via pg_basebackup |
| Target version | PostgreSQL 17 (pg_auto_failover retained as HA manager) |
pg_auto_failover service on secondary and primarypg_auto_failover service on monitorpg_upgrade primary (PG 15 → PG 17)pg_upgrade monitor (PG 15 → PG 17)pg_auto_failover service on monitorpg_auto_failover service on primary — verify transitions to primarypg_basebackupNote: Why primary before monitor? The monitor only stores cluster state metadata — it does not hold application data. Upgrading the primary first allows full data verification before touching the monitor. Both orderings are valid, but upgrading the primary first means any data issue is caught before the monitor is changed, making rollback simpler.
Step 1 — Put Secondary in Maintenance Mode
Verify on monitor:
# On monitor — as postgres user
/opt/vmware/postgres/15/bin/pg_autoctl enable maintenance \\
--pgdata /var/lib/pgsql/data \\
--name node_2
/opt/vmware/postgres/15/bin/pg_autoctl show state --pgdata /var/lib/pgsql/dataStep 2 — Stop pg_auto_failover Service on Secondary and Primary
If pg_auto_failover runs as a systemd service (check with systemctl status pgautofailover.service):
# On secondary — as root
systemctl stop pgautofailover.service
# On primary — as root
systemctl stop pgautofailover.service
If there is no systemd service and pg_autoctl was started manually:
# On secondary — as postgres user
/opt/vmware/postgres/15/bin/pg_autoctl stop --pgdata /var/lib/pgsql/data
# On primary — as postgres user
/opt/vmware/postgres/15/bin/pg_autoctl stop --pgdata /var/lib/pgsql/data
Verify all processes have stopped before proceeding:
pgrep -a pg_autoctl # should return nothing
pgrep -a postgres # should return nothing
Step 3 — Stop pg_auto_failover Service on Monitor
# On monitor — as root
systemctl stop pgautofailover.service
Step 4 — Upgrade Primary with pg_upgrade
# On primary — as postgres user
cd ~
# Initialize new PG 17 data directory (no checksums — must match PG 15)
/opt/vmware/postgres/17/bin/initdb \\
-D /var/lib/pgsql/17/data \\
--encoding=UTF8 --locale=en_US.UTF-8
# Dry run — verify compatibility first
/opt/vmware/postgres/17/bin/pg_upgrade \\
--old-datadir=/var/lib/pgsql/data \\
--new-datadir=/var/lib/pgsql/17/data \\
--old-bindir=/opt/vmware/postgres/15/bin \\
--new-bindir=/opt/vmware/postgres/17/bin \\
--link --check
Fix any errors reported before proceeding.
# Run actual upgrade
/opt/vmware/postgres/17/bin/pg_upgrade \\
--old-datadir=/var/lib/pgsql/data \\
--new-datadir=/var/lib/pgsql/17/data \\
--old-bindir=/opt/vmware/postgres/15/bin \\
--new-bindir=/opt/vmware/postgres/17/bin \\
--link
Step 5 — Update Primary: Binaries and SSL Files
SSL Files: pg_upgrade does not carry SSL certificates to the new data directory. Back up server.crt, server.key, root.crt (if present), and any custom cert files from the old data directory before swapping directories. Restore them to the new $PGDATA with correct permissions before starting PostgreSQL. Verify that the permissions on the restored files match those from the original data directory.
Step 6 — Verify Data on Primary
# Swap directories
mv /var/lib/pgsql/data /var/lib/pgsql/data_pg15_bak
mv /var/lib/pgsql/17/data /var/lib/pgsql/data
# Update pg_autoctl config to point to PG 17 binaries
/opt/vmware/postgres/17/bin/pg_autoctl config set \\
postgresql.pg_ctl /opt/vmware/postgres/17/bin/pg_ctl \\
--pgdata /var/lib/pgsql/data
# Start PG 17 manually to verify data
/opt/vmware/postgres/17/bin/pg_ctl -D /var/lib/pgsql/data start
# Verify version and data
/opt/vmware/postgres/17/bin/psql -c "SELECT version();"
/opt/vmware/postgres/17/bin/psql -c "\\dt"
# Run post-upgrade tasks generated by pg_upgrade
/opt/vmware/postgres/17/bin/psql -f update_extensions.sql -U postgres
/opt/vmware/postgres/17/bin/vacuumdb --all --analyze-in-stages -U postgres
# Stop before handing back to pg_autoctl
/opt/vmware/postgres/17/bin/pg_ctl -D /var/lib/pgsql/data stop
Step 7 — Upgrade Monitor with pg_upgrade
Critical: The --new-options flag is required to prevent the pg_auto_failover healthcheck background worker from starting inside the temporary pg_upgrade server. Without it, the worker holds an open connection to template1 and blocks DROP DATABASE, causing pg_upgrade to hang indefinitely.
# On monitor — as postgres user
cd ~
# Initialize new PG 17 data directory (no checksums — must match PG 15)
/opt/vmware/postgres/17/bin/initdb \\
-D /var/lib/pgsql/17/data \\
--encoding=UTF8 --locale=en_US.UTF-8
# Dry run — verify compatibility first
/opt/vmware/postgres/17/bin/pg_upgrade \\
--old-datadir=/var/lib/pgsql/data \\
--new-datadir=/var/lib/pgsql/17/data \\
--old-bindir=/opt/vmware/postgres/15/bin \\
--new-bindir=/opt/vmware/postgres/17/bin \\
--new-options "-c shared_preload_libraries=pgautofailover -c max_worker_processes=0" \\
--link --check
Fix any errors reported before proceeding.
# Run actual upgrade
/opt/vmware/postgres/17/bin/pg_upgrade \\
--old-datadir=/var/lib/pgsql/data \\
--new-datadir=/var/lib/pgsql/17/data \\
--old-bindir=/opt/vmware/postgres/15/bin \\
--new-bindir=/opt/vmware/postgres/17/bin \\
--new-options "-c shared_preload_libraries=pgautofailover -c max_worker_processes=0" \\
--link
# Swap directories
mv /var/lib/pgsql/data /var/lib/pgsql/data_pg15_bak
mv /var/lib/pgsql/17/data /var/lib/pgsql/data
Step 8 — Update Monitor: Binaries, Service File, SSL Files, pg_hba
SSL Files — pg_upgrade does NOT copy SSL certificates to the new data directory. Before swapping the data directories, back up the following files from the old data directory:
server.crt (Server certificate)server.key (Private key)root.crt (CA certificate, if present)postgresql.conf via ssl_cert_file, ssl_key_file, ssl_ca_file.After swapping directories, restore these files to the new $PGDATA. Verify that the permissions on the restored files match those from the original data directory before starting PostgreSQL.
# As postgres user — update pg_autoctl config to point to PG 17
/opt/vmware/postgres/17/bin/pg_autoctl config set \\
postgresql.pg_ctl /opt/vmware/postgres/17/bin/pg_ctl \\
--pgdata /var/lib/pgsql/data
# As root — update service file to use PG 17 binaries
sed -i 's|/opt/vmware/postgres/15/bin|/opt/vmware/postgres/17/bin|g' \\
/etc/systemd/system/pgautofailover.service
systemctl daemon-reload
# As postgres user — verify pg_hba entries for primary and secondary after swap
# pg_upgrade carries over pg_hba.conf, but verify the pg_autoctl-managed hostssl entries are present. Add them if missing.
cat >> /var/lib/pgsql/data/pg_hba.conf << 'EOF'
hostssl pg_auto_failover autoctl_node <primary-ip>/32 trust
hostssl pg_auto_failover autoctl_node <secondary-ip>/32 trust
hostssl all all <primary-ip>/32 trust
hostssl all all <secondary-ip>/32 trust
EOF
Step 9 — Start pg_auto_failover Service on Monitor
# Start monitor (as root)
systemctl start pgautofailover.service
systemctl status pgautofailover.service
Step 10 — Fix System Identifier Mismatch
This is a known issue. pg_upgrade generates a new system identifier for the upgraded cluster. The pg_autoctl state file and the monitor’s node table both retain the old PG 15 system identifier and must be updated manually before pg_autoctl will function correctly.
Reference: hapostgres/pg_auto_failover #506
10a — Get the new PG 17 system identifier from the primary
# On primary — as postgres user
NEW_SYSID=$(/opt/vmware/postgres/17/bin/pg_controldata /var/lib/pgsql/data \\
| grep "system identifier" | awk '{print $NF}')
echo "New system identifier: $NEW_SYSID"
10b — Update the monitor’s stored identifier
# On monitor — as postgres user
/opt/vmware/postgres/17/bin/psql -d pg_auto_failover -c \\
"UPDATE pgautofailover.node SET sysidentifier = $NEW_SYSID WHERE nodehost = 'first-primary';"
# Confirm
/opt/vmware/postgres/17/bin/psql -d pg_auto_failover -c \\
"SELECT nodename, nodehost, sysidentifier FROM pgautofailover.node;"
10c — Recreate the pg_autoctl state file on primary
# On primary — re-register to regenerate the state file
/opt/vmware/postgres/17/bin/pg_autoctl create postgres \\
--pgdata /var/lib/pgsql/data \\
--pgport 5432 \\
--pgctl /opt/vmware/postgres/17/bin/pg_ctl \\
--hostname first-primary \\
--auth trust \\
--ssl-self-signed \\
--monitor "postgres://autoctl_node@<monitor-ip>:5432/pg_auto_failover?sslmode=require"
Step 11 — Start pg_auto_failover Service on Primary
If pg_auto_failover runs as a systemd service:
# On primary — as root
systemctl start pgautofailover.service
systemctl status pgautofailover.service
If there is no systemd service, start the process in the foreground to observe output:
# On primary — as postgres user
/opt/vmware/postgres/17/bin/pg_autoctl run --pgdata /var/lib/pgsql/data
Expected: node state transitions from wait_primary → primary.
Confirm on the monitor:
# On monitor — as postgres user
/opt/vmware/postgres/17/bin/pg_autoctl show state --pgdata /var/lib/pgsql/data
Step 12 — Rebuild Secondary from Upgraded Primary
The secondary cannot be upgraded in-place. It must be rebuilt as a fresh physical replica of the now-PG 17 primary. pg_autoctl handles this automatically via pg_basebackup during create postgres.
# On secondary — as postgres user
# Back up old PG 15 data (do not delete until cluster is verified)
mv /var/lib/pgsql/data /var/lib/pgsql/data_pg15_bak
# Update pg_autoctl config binary path (if config file still exists)
/opt/vmware/postgres/17/bin/pg_autoctl config set \\
postgresql.pg_ctl /opt/vmware/postgres/17/bin/pg_ctl \\
--pgdata /var/lib/pgsql/data
# Re-register as secondary — pg_autoctl will run pg_basebackup internally
/opt/vmware/postgres/17/bin/pg_autoctl create postgres \\
--pgdata /var/lib/pgsql/data \\
--pgport 5432 \\
--pgctl /opt/vmware/postgres/17/bin/pg_ctl \\
--hostname first-secondary \\
--auth trust \\
--ssl-self-signed \\
--monitor "postgres://autoctl_node@<monitor-ip>:5432/pg_auto_failover?sslmode=require" \\
--run
Step 13 — Verify Final Cluster State
Expected output:
# On monitor — as postgres user
/opt/vmware/postgres/17/bin/pg_autoctl show state --pgdata /var/lib/pgsql/data
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State
-------+------+------------------------+----------+------------+----------------+----------------
node_2 | 2 | first-secondary:5432 | ... | read-only | secondary | secondary
Post-Upgrade Cleanup
Once the cluster has been running stably for a few days:
# On primary and monitor (after verifying upgrade is stable)
# Run the script generated by pg_upgrade to remove old PG 15 data files
~/delete_old_cluster.sh
# Or manually:
rm -rf /var/lib/pgsql/data_pg15_bak
The guide is for a 1 primary + 1 secondary + 1 monitor architecture, for a multiple secondary node cluster, the upgrade process is the same — the secondary rebuild step is repeated for each of the secondary nodes, with no difference in procedure between them.
pg_auto_failover does not provide an official step-by-step major upgrade guide. The following GitHub issues are relevant and should be referenced:
pg_auto_failover maintainers.pg_upgrade changes the PostgreSQL system identifier. This breaks the pg_autoctl state file and monitor node registration. Manual patching is required (see Step 9 below).pg_upgrade on the monitor node — When upgrading the monitor, pg_upgrade starts a temporary internal PostgreSQL server to perform the schema migration. Because the old monitor cluster has pgautofailover registered in shared_preload_libraries, the healthcheck background worker starts automatically inside this temporary server and immediately opens a connection to template1. This blocks pg_upgrade’s DROP DATABASE template1 step, causing the upgrade to hang indefinitely with no visible error. Workaround: Pass -new-options "-c shared_preload_libraries=pgautofailover -c max_worker_processes=0" to pg_upgrade. This prevents all background workers from starting in the temporary upgrade server. This is required only for the monitor node upgrade. Reference: hapostgres/pg_auto_failover #506 — comment by @WoSan7000pg_upgrade (affects primary node) — pg_upgrade performs a fresh initdb for the new cluster, which assigns a new system_identifier to the upgraded data directory. The pg_auto_failover agent (pg_autoctl) stores the old system identifier in its local state file and the monitor stores it in the pgautofailover.node table. After upgrading, the state file and the monitor record must be updated manually with the new identifier. Reference: hapostgres/pg_auto_failover #506 — comment by @WoSan7000