gpexpand Fails to Create Template Directory on a Mirrorless Cluster

Products

VMware Tanzu Data Suite

Issue/Introduction

Issue Description

When running Greenplum Database on a mirrorless cluster (for example, PowerFlex), the documentation recommends setting the following WAL-related GUCs:

Reference: Greenplum HA Service Installation Guide

max_replication_slots=0
wal_keep_segments=0
wal_level=minimal
max_wal_senders=0

However, when expanding the cluster using gpexpand, the first phase of the expansion may fail with the following error.

$ gpexpand -i expand_conf.txt
...
20251216:14:55:20:032583 gpexpand:gpdb12:gpadmin-[INFO]:-Creating segment template
20251216:14:55:21:032583 gpexpand:gpdb12:gpadmin-[ERROR]:-gpexpand failed: ExecutionError: 'non-zero rc: 1' occurred.  Details: 'pg_basebackup -c fast -D /data/master/master_6.31.1/gpexpand_12162025_32583 -h gpdb12 -p 5432 --xlog -E ./db_dumps -E ./gpperfmon/data -E ./gpperfmon/logs -E ./promote -E ./db_analyze --target-gp-dbid 73785 --progress --verbose'  cmd had rc=1 completed=True halted=False
  stdout=''
  stderr='pg_basebackup: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 0)
'
Exiting...
20251216:14:55:21:032583 gpexpand:gpdb12:gpadmin-[ERROR]:-Please run 'gpexpand -r' to rollback to the original state.
20251216:14:55:21:032583 gpexpand:gpdb12:gpadmin-[INFO]:-Shutting down gpexpand...

Cause

Root Cause

The error is generated by pg_basebackup.
During the gpexpand process, pg_basebackup is used to create a template directory for new segments.
The pg_basebackup command includes the --xlog parameter because WAL files must be shipped during the master data directory copy.
As documented in the PostgreSQL manual for pg_basebackup, max_wal_senders must be configured to a value high enough to allow at least one WAL sender connection for the backup operation.
When max_wal_senders is set to 0, pg_basebackup fails, causing the first phase of gpexpand to abort.

Resolution

Workaround

To allow gpexpand to complete successfully, perform the following steps.

Step 1: Roll Back the Failed gpexpand Operation

# gpexpand -r

Step 2: Reset the WAL-Related GUCs

Reset all GUCs that were recommended in the HA service document.
All parameters must be reset together, otherwise the database may fail to restart (for example, wal_level cannot be minimal while max_wal_senders > 0).

gpconfig -r max_replication_slots  --skipvalidation
gpconfig -r wal_keep_segments      --skipvalidation
gpconfig -r wal_level              --skipvalidation
gpconfig -r max_wal_senders        --skipvalidation

Step 3: Restart the Database

gpstop -M fast
gpstart

Step 4: Retry gpexpand

# gpexpand -i expand_conf.txt

Step 5: Restore the Recommended GUCs

After confirming that the first phase of gpexpand completes successfully, set the GUCs back to the recommended values and restart the database to apply them.

gpconfig -c max_replication_slots -v 0       --skipvalidation
gpconfig -c wal_keep_segments     -v 0       --skipvalidation
gpconfig -c wal_level             -v minimal --skipvalidation
gpconfig -c max_wal_senders       -v 0       --skipvalidation

gpstop -M fast
gpstart

Step 6: Complete the Remaining Expansion Steps

Continue with the remaining expansion and redistribution steps as documented in: Greenplum Database Expansion and Redistribution Guide

Additional Information

The product team plans to address this behavior in future releases, including the following improvements:

Item 1: gpexpand logic enhancement

The design of gpexpand will be improved. In future releases, gpexpand will first validate the relevant GUC settings. If a GUC is set to 0 (indicating a mirrorless cluster), gpexpand will fail at the very beginning with clear error messages, guidance, and recommended actions.

Item 2: Documentation update

The documentation team will update the gpexpand and HA service guides to explicitly describe the required workaround when running gpexpand on mirrorless clusters.