gpactivatestandby not able to bring up the cluster and erroring as "FATAL: the database system is starting up"

Products

VMware Tanzu Data Suite VMware Tanzu Greenplum VMware Tanzu Data Suite

Issue/Introduction

The following scenario where the database is shutdown and cannot be started on the normal coordinator can result in the standby coordinator being started in utility mode after a forced activation of the standby:

The database and cluster is operating in normal mode and the standby coordinator is in sync:

[gpadmin@cdw ~]$ gpstate -e
20250521:10:51:40:439052 gpstate:cdw:gpadmin-[INFO]:-Starting gpstate with args: -e
20250521:10:51:40:439052 gpstate:cdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 7.3.3 build commit:ce20fc237ed7520a2476c96ed7d9edddea136932'
20250521:10:51:40:439052 gpstate:cdw:gpadmin-[INFO]:-coordinator Greenplum Version: 'PostgreSQL 12.12 (Greenplum Database 7.3.3 build commit:ce20fc237ed7520a2476c96ed7d9edddea136932) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22), 64-bit compiled on Dec 18 2024 05:34:04 Bhuvnesh C.'
20250521:10:51:40:439052 gpstate:cdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
20250521:10:51:40:439052 gpstate:cdw:gpadmin-[INFO]:-Gathering data from segments...
20250521:10:51:41:439052 gpstate:cdw:gpadmin-[INFO]:-----------------------------------------------------
20250521:10:51:41:439052 gpstate:cdw:gpadmin-[INFO]:-Segment Mirroring Status Report
20250521:10:51:41:439052 gpstate:cdw:gpadmin-[INFO]:-----------------------------------------------------
20250521:10:51:41:439052 gpstate:cdw:gpadmin-[INFO]:-All segments are running normally

[gpadmin@cdw ~]$ gpstate -f
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-Starting gpstate with args: -f
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 7.3.3 build commit:ce20fc237ed7520a2476c96ed7d9edddea136932'
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-coordinator Greenplum Version: 'PostgreSQL 12.12 (Greenplum Database 7.3.3 build commit:ce20fc237ed7520a2476c96ed7d9edddea136932) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22), 64-bit compiled on Dec 18 2024 05:34:04 Bhuvnesh C.'
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-Standby coordinator details
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-----------------------
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-   Standby address          = scdw
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-   Standby data directory   = /data/coordinator/gpseg-1
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-   Standby port             = 5432
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-   Standby PID              = 3387966
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:-   Standby status           = Standby host passive
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--------------------------------------------------------------
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--pg_stat_replication
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--------------------------------------------------------------
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--WAL Sender State: streaming
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--Sync state: sync
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--Sent Location: 0/A0000060
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--Flush Location: 0/A0000060
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--Replay Location: 0/A0000060
20250521:10:51:45:439094 gpstate:cdw:gpadmin-[INFO]:--------------------------------------------------------------

Stop/shutdown the database using command

[gpadmin@cdw ~]$ gpstop -af

Log into the standby coordinator and activate the standby coordinator. The "-f" option, force option, needs to be used as the database is not up and running:

[gpadmin@cdw ~]$ gpactivatestandby -f
   :
   :
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[DEBUG]:-Check if Coordinator is already running...
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[WARNING]:-****************************************************************************
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[WARNING]:-Coordinator-only start requested. If a standby is configured, this command
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[WARNING]:-may lead to a split-brain condition and possible unrecoverable data loss.
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[WARNING]:-Maintenance mode should only be used with direction from Greenplum Support.
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[WARNING]:-****************************************************************************
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[DEBUG]:-Running Command: $GPHOME/sbin/gpconfig_helper.py --file /data/coordinator/gpseg-1/postgresql.conf --get-parameter gp_segment_configuration_file
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[INFO]:-Starting Coordinator instance in admin mode
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /data/coordinator/gpseg-1 -l /data/coordinator/gpseg-1/log/startup.log -w -t 600 -o " -c gp_role=utility " start
20250521:11:11:43:3394218 gpstart:scdw:gpadmin-[DEBUG]:-Running Command: env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /data/coordinator/gpseg-1 -l /data/coordinator/gpseg-1/log/startup.log -w -t 600 -o " -c gp_role=utility " start
20250521:11:11:44:3394218 gpstart:scdw:gpadmin-[INFO]:-Obtaining Greenplum Coordinator catalog information
20250521:11:11:44:3394218 gpstart:scdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
20250521:11:11:44:3394218 gpstart:scdw:gpadmin-[DEBUG]:-Connecting to db template1 on host localhost
20250521:11:11:44:3394218 gpstart:scdw:gpadmin-[ERROR]:-gpstart failed.  exiting...
Traceback (most recent call last):
  File "/usr/local/greenplum-db-7.3.3/lib/python/gppylib/mainUtils.py", line 361, in simple_main_locked
    exitCode = commandObject.run()
  File "/usr/local/greenplum-db-7.3.3/bin/gpstart", line 120, in run
    self._startCoordinator()
  File "/usr/local/greenplum-db-7.3.3/bin/gpstart", line 435, in _startCoordinator
    self.gparray = GpArray.initFromCatalog(self.dburl, utility=True)
  File "/usr/local/greenplum-db-7.3.3/lib/python/gppylib/gparray.py", line 990, in initFromCatalog
    with closing(dbconn.connect(dbURL, utility)) as conn:
  File "/usr/local/greenplum-db-7.3.3/lib/python/gppylib/db/dbconn.py", line 238, in connect
    conn = psycopg2.connect(**conninfo)
  File "/usr/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  the database system is starting up
DETAIL:  last replayed record at 0/A0000208

'
  stderr=''

The standby coordinator is now running on utility mode:

[gpadmin@scdw ~]$ ps -aef | egrep -- -D
gpadmin  3394230       1  0 11:11 ?        00:00:00 /usr/local/greenplum-db-7.3.3/bin/postgres -D /data/coordinator/gpseg-1 -c gp_role=utility

Environment

Greenplum 7.x

Cause

There was an upstream(postgres) change made command "pg_ctl start -w" to rely on the postmaster status in "postmaster.pid" file to know if postmaster is started or not.

The status to wait for can be either "ready" or "standby" (in upstream). Both of the status would make "pg_ctl start -w" return.

This status is not cleared after server shutdown. So, if a standby server was once a standby, but later becomes a primary, its existing "postmaster.pid" file would already contain "standby" status. So "pg_ctl start -w" would not wait.

This makes "gpstart" query the server while it's still recovering, hence the error "FATAL: the database system is starting up".

In contrast, in 6X has the old logic where pg_ctl -w keeps testing if a connection can be made (`test_postmaster_connection`) before returning. So the issue does not occur.

Resolution

Workaround:

Stop the coordinator on the standby coordinator host after the failed gpactivatestandby command:

gpstop -am

Start the database in normal mode:

gpstart -a

Fix

The code fix will be available in Greenplum 7.5 and above.