Full recovery fails when hostname and address are on different networks in Greenplum

Products

VMware Tanzu Greenplum

Issue/Introduction

The method for running full recovery (gprecoverseg -F) in Pivotal Greenplum has changed between version v5.x and v6.x. In Pivotal Greenplum 6.x, you now use pg_basebackup to fully restore a down segment rather than using file replication. This change increases the speed of full recovery.

However, if your hostname and address fields are different in gp_segment_configuration, full recovery will use the hostname rather than the address to run pg_basebackup. This situation is common if you are looking for query traffic to travel through a faster VIP network vs other traffic which would go through an external network. You can observe this behavior in the following log messages:

gp_segment_configuration

dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+-------+-------------------------------+---------------------------------+--------------------------
1142 | 372 | m | m | n | d | 41000 | sdw1 | sdw1-2 | /data01/mirror/gpseg372
194 | 192 | m | p | n | d | 40000 | sdw1 | sdw1-1 | /data01/primary/gpseg192
(2 rows)

gprecoverseg log

Continue with segment recovery procedure Yy|Nn (default=N):
> y
20191226:16:10:43:159051 gprecoverseg:sdw1:gpadmin-[INFO]:-2 segment(s) to recover
20191226:16:10:43:159051 gprecoverseg:sdw1:gpadmin-[INFO]:-Ensuring 2 failed segment(s) are stopped
20191226:16:10:43:159051 gprecoverseg:sdw1:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20191226:16:10:44:159051 gprecoverseg:sdw1:gpadmin-[INFO]:-Validating remote directories
20191226:16:10:44:159051 gprecoverseg:sdw1:gpadmin-[INFO]:-Configuring new segments
sdw1 (dbid 194):
sdw1 (dbid 1142):
20191226:16:10:45:159051 gprecoverseg:sdw1:gpadmin-[CRITICAL]:-Error occurred: Error Executing Command:
Command was: 'ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 sdw1 ". /usr/local/greenplum-db/./greenplum_path.sh; $GPHOME/bin/lib/gpconfigurenewsegment -c \"/data01/primary/gpseg192:40000:false:false:194:192:sdw2:41000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid194.out,/data01/mirror/gpseg372:41000:false:false:1142:372:sdw3:40000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid1142.out\" -l /home/gpadmin/gpAdminLogs -n -B 16 --force-overwrite"'
rc=1, stdout='20191226:16:10:45:159994 gpconfigurenewsegment:sdw1:gpadmin-[INFO]:-Starting gpconfigurenewsegment with args: -c /data01/primary/gpseg192:40000:false:false:194:192:sdw2:41000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid194.out,/data01/mirror/gpseg372:41000:false:false:1142:372:sdw3:40000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid1142.out -l /home/gpadmin/gpAdminLogs -n -B 16 --force-overwrite
...
ExecutionError: 'Error Executing Command: ' occured. Details: '/usr/local/greenplum-db/./bin/lib/gpconfigurenewsegment -c /data01/primary/gpseg192:40000:false:false:194:192:sdw2:41000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid194.out,/data01/mirror/gpseg372:41000:false:false:1142:372sdw3:40000:/home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid1142.out -l /home/gpadmin/gpAdminLogs -n -B 16 --force-overwrite' cmd had rc=1 completed=True halted=False
stdout=''
stderr='ExecutionError: 'non-zero rc: 1' occured. Details: 'pg_basebackup -c fast -D /data01/primary/gpseg192 -h sdw2 -p 41000 --slot internal_wal_replication_slot --xlog-method stream --force-overwrite --write-recovery-conf --target-gp-dbid 194 -E ./db_dumps -E ./gpperfmon/data -E ./gpperfmon/logs -E ./promote --progress --verbose > /home/gpadmin/gpAdminLogs/pg_basebackup.20191226_161044.dbid194.out 2>&1' cmd had rc=1 completed=True halted=False
stdout=''
stderr='''

pg_basebackup log

pg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "10.130.211.246", user "gpadmin", SSL off

Note: As you can see, the proper pg_hba.conf entry is not present as gprecoverseg is looking for the hostname external IP rather than the VIP which is in the address field.

Environment

Product Version: 6.1

Resolution

This is a known issue and will be improved in a later version of Greenplum 6.x. For now, to get past this error, do the following:

1. Add the following entry in pg_hba.conf in the master:

host replication gpadmin 10.130.211.0/24 trust

2. The database field must be set to replication. If it is set to any other database name, then it will not work.

The full explanation on the difference can be found here: https://www.postgresql.org/docs/9.4/auth-pg-hba-conf.html

Additional Information

Specifies which database name(s) this record matches. The value all specifies that it matches all databases. The value sameuser specifies that the record matches if the requested database has the same name as the requested user. The value samerole specifies that the requested user must be a member of the role with the same name as the requested database. (samegroup is an obsolete but still accepted spelling of samerole.) Superusers are not considered to be members of a role for the purposes of samerole unless they are explicitly members of the role, directly or indirectly, and not just by virtue of being a superuser. The value replication specifies that the record matches if a replication connection is requested (note that replication connections do not specify any particular database). Otherwise, this is the name of a specific PostgreSQL database. Multiple database names can be supplied by separating them with commas. A separate file containing database names can be specified by preceding the file name with @.