The gpexpand on GPv5x failed due to failed to start standby master instance

Products

VMware Tanzu Greenplum

Issue/Introduction

This issue might occur when expanding the Greenplum version5.x Cluster with standby master configured.
During the "adding segment" stage, the gpexpand will restart the Database into master-only mode and when it restarts back to normal mode, the standby master might not able to start with the below error:

20220630:01:09:20:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-cmd had rc=0 completed=True halted=False
  stdout='[38514, 38515]
'
  stderr=''
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[WARNING]:-Could not start standby master
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[WARNING]:-Standby Master could not be started
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-WorkerPool haltWork()
20220630:01:09:21:088962 gpstart:gp-aio-01:gpadmin-[DEBUG]:-[worker0] haltWork

When the standby master is functional, we should have 3 child processes forked by the postmaster, and since they are starting in sequence so the PID should be in series, as the below example shows

$ ps -ef | grep 5432 | grep -v grep
gpadmin    8286      1  2 11:07 ?        00:00:00 /opt/greenplum_5.28.13/bin/postgres -D /data/master/master_5.28.13/gpdb_5.28.13_-1 -p 5432 --gp_dbid=6 --gp_num_contents_in_cluster=2 --silent-mode=true -i -M master --gp_contentid=-1 -x 0 -y -E
gpadmin    8329   8286  0 11:07 ?        00:00:00 postgres:  5432, master logger process
gpadmin    8330   8286  0 11:07 ?        00:00:00 postgres:  5432, startup process   recovering 000000010000000E00000008
gpadmin    8331   8286  0 11:07 ?        00:00:00 postgres:  5432, wal receiver process   streaming E/200004D8

As the above error shows, the script only detect 2 PID, which means we missed one of the processes, by checking the # ps -ef output from the standby master, we can confirm the missing process is the wal receiver process.

The pg_log of the standby master should tell us why it is unable to start. for this issue, that is because the request WAL log has been removed

2022-06-30 01:07:56.206495 AEST,,,p38516,th-1139710080,,,,0,,,seg-1,,,,,"LOG","00000","streaming replication successfully connected to primary, starting replication at 2/F8000000",,,,,,,0,,"gp_libpqwalreceiver.c",162,
2022-06-30 01:07:56.378948 AEST,,,p38516,th-1139710080,,,,0,,,seg-1,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000020000003E has already been removed (gp_libpqwalreceiver.c:394)",,,,,,,0,,"gp_libpqwalreceiver.c",394,"Stack trace:
1    0x96598b postgres errstart (elog.c:521)
2    0x806ee8 postgres walrcv_receive (gp_libpqwalreceiver.c:363)
3    0x80aa09 postgres WalReceiverMain (walreceiver.c:378)
4    0x58a6d1 postgres AuxiliaryProcessMain (bootstrap.c:501)
5    0x7d839b postgres <symbol not found> (postmaster.c:7395)
6    0x7d9792 postgres <symbol not found> (postmaster.c:7488)
7    0x7efeb8287630 libpthread.so.0 <symbol not found> + 0xb8287630
8    0x7efeb76f3983 libc.so.6 __select + 0x13
9    0x7e0547 postgres <symbol not found> (postmaster.c:2349)
10   0x7e362a postgres PostmasterMain (postmaster.c:1533)
11   0x4cdbc7 postgres main (main.c:206)
12   0x7efeb7620555 libc.so.6 __libc_start_main + 0xf5
13   0x4ce17c postgres <symbol not found> + 0x4ce17c

Environment

Product Version: 5.28

Resolution

When setting up the master to standby master replication, when the lag between those 2 nodes is greater than wal_keep_segments(default:5) of WAL segments, at the next checkpoint the master will recycle the WAL file which has not synced to the standby master.

During the gpexpand, at the beginning the script will put Greenplum into the master-only mode, during this mode, the standby will not start. All the changes made on the master will be saved into the WAL log (in the pg_xlog folder).
One of the steps in gpexpand is to do # VACUUM FULL on the master's system table, which goal is to clean up all the bloat on the catalog table to reduce the size of the master folder

20220616:13:21:59:151161 gpexpand:hostname:gpadmin-[INFO]:-VACUUM FULL on the catalog tables

if the # VACUUM FULL generated more than wal_keep_segments WAL log, then at the next restart, it will do a checkpoint and remove the oldest WAL log, which is not been synced to the standby node yet and cause such an issue.

We can work around this issue by either one of the below options:

Option#1:
- remove the standby master before doing gpexpand
- once the expansion is finished, add the standby master back

!!! WARNING!!!:
if we choose this option, then we will lose the redundancy during the expansion stage as we no longer have a standby master available, this might cause single-point failure if any accident happens on the master host, for example, a hardware issue causes master filesystem corruption. in the worst case, if we lose the data of the master then the only option to restore the cluster is to re-create a new cluster and restore the data from the backup.

Option#2:
- increase the keep_wal_segments temporary before doing gpexpand, and set it back after the expansion

# gpconfig -c keep_wal_segments -v 1000 --skipvalidation
# gpstop -u
# gpconfig -s keep_wal_segments

Option#3:
- Do vacuum full manually in normal mode before expanding the cluster
- run the gpexpand with -V option to disable VACUUM FULL

$ gpexpand --help | grep -A1 "^\-V"
-V | --novacuum
 Do not vacuum catalog tables before creating schema copy.

Note: this issue only exists in Greenplum with versions before 6.x, In Greenplum v6.x, the logic of gpexpand has changed, it will no longer restart the DB into master-only mode and do vacuum full, so will not trigger this issue.