This issue might occur when expanding the Greenplum version5.x Cluster with standby master configured.
During the "adding segment" stage, the gpexpand will restart the Database into master-only mode and when it restarts back to normal mode, the standby master might not able to start with the below error:
20220630:01:09:20:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-cmd had rc=0 completed=True halted=False
stdout='[38514, 38515]
'
stderr=''
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[WARNING]:-Could not start standby master
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[WARNING]:-Standby Master could not be started
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-WorkerPool haltWork()
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-[worker0] haltWork
When the standby master is functional, we should have 3 child processes forked by the postmaster, and since they are starting in sequence so the PID should be in series, as the below example shows
$ ps -ef | grep 5432 | grep -v grep
gpadmin 8286 1 2 11:07 ? 00:00:00 /opt/greenplum_5.28.13/bin/postgres -D /data/master/master_5.28.13/gpdb_5.28.13_-1 -p 5432 --gp_dbid=6 --gp_num_contents_in_cluster=2 --silent-mode=true -i -M master --gp_contentid=-1 -x 0 -y -E
gpadmin 8329 8286 0 11:07 ? 00:00:00 postgres: 5432, master logger process
gpadmin 8330 8286 0 11:07 ? 00:00:00 postgres: 5432, startup process recovering 000000010000000E00000008
gpadmin 8331 8286 0 11:07 ? 00:00:00 postgres: 5432, wal receiver process streaming E/200004D8
As the above error shows, the script only detect 2 PID, which means we missed one of the processes, by checking the
# ps -ef output from the standby master, we can confirm the missing process is the
wal receiver process.The pg_log of the standby master should tell us why it is unable to start. for this issue, that is because the request WAL log has been removed
2022-06-30 01:07:56.206495 AEST,,,p38516,th-1139710080,,,,0,,,seg-1,,,,,"LOG","00000","streaming replication successfully connected to primary, starting replication at 2/F8000000",,,,,,,0,,"gp_libpqwalreceiver.c",162,
2022-06-30 01:07:56.378948 AEST,,,p38516,th-1139710080,,,,0,,,seg-1,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000020000003E has already been removed (gp_libpqwalreceiver.c:394)",,,,,,,0,,"gp_libpqwalreceiver.c",394,"Stack trace:
1 0x96598b postgres errstart (elog.c:521)
2 0x806ee8 postgres walrcv_receive (gp_libpqwalreceiver.c:363)
3 0x80aa09 postgres WalReceiverMain (walreceiver.c:378)
4 0x58a6d1 postgres AuxiliaryProcessMain (bootstrap.c:501)
5 0x7d839b postgres <symbol not found> (postmaster.c:7395)
6 0x7d9792 postgres <symbol not found> (postmaster.c:7488)
7 0x7efeb8287630 libpthread.so.0 <symbol not found> + 0xb8287630
8 0x7efeb76f3983 libc.so.6 __select + 0x13
9 0x7e0547 postgres <symbol not found> (postmaster.c:2349)
10 0x7e362a postgres PostmasterMain (postmaster.c:1533)
11 0x4cdbc7 postgres main (main.c:206)
12 0x7efeb7620555 libc.so.6 __libc_start_main + 0xf5
13 0x4ce17c postgres <symbol not found> + 0x4ce17c