gpstart Shows Message, "[WARNING]:-Could not start standby master"

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

gpstart shows a WARNING message about the standby master stating that it cannot be started. Any utility that needs database restart during the process, such as gpexpand, gppersistentrebuild will exit.

The error message is shown below:

20160507:15:06:43:011062 gpstart:mdw:gpadmin-[WARNING]:-Could not start standby master
20160507:15:06:43:011062 gpstart:mdw:gpadmin-[WARNING]:-Standby Master could not be started

Environment

Cause

There could be several reasons for this failure. Two common issues and their resolutions are outlined below.

The error message needs to be checked from logs under standby master pg_log.

Cause 1: Error: could not connect to the primary server

2016-05-07 14:38:08.459996 EDT,,,p39826,th1541760800,,,,0,,,seg-1,,,,,"ERROR","XX000","could not connect to the primary server: FATAL:  no pg_hba.conf entry for replication connection from host ""172.28.8.251"", user ""gpadmin"", SSL off (gp_libpqwalreceiver.c:81)",,,,,,,0,,"gp_libpqwalreceiver.c",81,"Stack trace:
1    0xb0591e postgres errstart + 0x4de

The master-mirroring WAL replication process needs a connection to be established with the primary master. This connection is made using the database name as replication. There is no need for the database "replication" to exist. When the connection is being made from the standby master using this database name, then the master will identify and treat this as a standby replication connection after ignoring the database name. On the pg_hba.conf of the primary master, there must be a setting in the pg_hba.conf file as shown below:

"host	replication		gpadmin 	172.28.8.251/32 	trust" <<<<<<< where 172.28.8.251 is the standby IP address

Because the setting was missing on the primary master, the standby was not getting started.

Cause 2: ERROR: could not receive data from WAL stream

2016-05-07 15:59:01.824158 EDT,,,p36393,th-890198240,,,,0,,,seg-1,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000001A600000003 has already been removed (gp_libpqwalreceiver.c:399)",,,,,,,0,,"gp_libpqwalreceiver.c",399,"Stack trace:
1    0xb0591e postgres errstart (elog.c:502)
or
2017-08-10 16:55:41.201193 GMT,"gpadmin",,p47942,th-1481238752,"192.168.11.6","23573",2017-08-10 16:55:41 GMT,0,con189,,seg-1,,,,,"ERROR","58P01","requested WAL segment 000000010000000000000026 has already been removed",,,,,,,0,,"walsend
er.c",881,"Stack trace:
1 0xb0a80e postgres errstart + 0x4de

If the connection between the master and the standby is lost, (as we see in the first error) then the master will continue to perform operations. When the standby comes up again, it will try to catch-up with the master. This error will be noticed if the requested xlog has been removed from the master. The only way to fix this is to reinitialize.

Resolution

Resolution 1

Add the missing entry in the pg_hba.conf of the primary master and start the database again:

"host	replication		gpadmin 	172.28.8.251/32 	trust" <<<<<<< where 172.28.8.251 is the standby IP address

Resolution 2

Remove the standby master and add it again. This is an online activity and does not need any downtime.

To remove, use the following:

gpinitstandby -r

To add, use the following:

gpinitstandby -s smdw <<<<< where smdw is the standby master to be configured