gpstart shows a WARNING message about the standby master stating that it cannot be started. Any utility that needs database restart during the process, such as gpexpand, gppersistentrebuild will exit.
The error message is shown below:
20160507:15:06:43:011062 gpstart:mdw:gpadmin-[WARNING]:-Could not start standby master 20160507:15:06:43:011062 gpstart:mdw:gpadmin-[WARNING]:-Standby Master could not be started
There could be several reasons for this failure. Two common issues and their resolutions are outlined below.
The error message needs to be checked from logs under standby master pg_log.
Cause 1: Error: could not connect to the primary server
2016-05-07 14:38:08.459996 EDT,,,p39826,th1541760800,,,,0,,,seg-1,,,,,"ERROR","XX000","could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host ""172.28.8.251"", user ""gpadmin"", SSL off (gp_libpqwalreceiver.c:81)",,,,,,,0,,"gp_libpqwalreceiver.c",81,"Stack trace: 1 0xb0591e postgres errstart + 0x4de
The master-mirroring WAL replication process needs a connection to be established with the primary master. This connection is made using the database name as replication. There is no need for the database "replication" to exist. When the connection is being made from the standby master using this database name, then the master will identify and treat this as a standby replication connection after ignoring the database name. On the pg_hba.conf of the primary master, there must be a setting in the pg_hba.conf file as shown below:
"host replication gpadmin 172.28.8.251/32 trust" <<<<<<< where 172.28.8.251 is the standby IP address
Because the setting was missing on the primary master, the standby was not getting started.
Cause 2: ERROR: could not receive data from WAL stream
2016-05-07 15:59:01.824158 EDT,,,p36393,th-890198240,,,,0,,,seg-1,,,,,"ERROR","XX000","could not receive data from WAL stream: ERROR: requested WAL segment 00000001000001A600000003 has already been removed (gp_libpqwalreceiver.c:399)",,,,,,,0,,"gp_libpqwalreceiver.c",399,"Stack trace: 1 0xb0591e postgres errstart (elog.c:502) or 2017-08-10 16:55:41.201193 GMT,"gpadmin",,p47942,th-1481238752,"192.168.11.6","23573",2017-08-10 16:55:41 GMT,0,con189,,seg-1,,,,,"ERROR","58P01","requested WAL segment 000000010000000000000026 has already been removed",,,,,,,0,,"walsend er.c",881,"Stack trace: 1 0xb0a80e postgres errstart + 0x4de
If the connection between the master and the standby is lost, (as we see in the first error) then the master will continue to perform operations. When the standby comes up again, it will try to catch-up with the master. This error will be noticed if the requested xlog has been removed from the master. The only way to fix this is to reinitialize.
Resolution 1
Add the missing entry in the pg_hba.conf of the primary master and start the database again:
"host replication gpadmin 172.28.8.251/32 trust" <<<<<<< where 172.28.8.251 is the standby IP address
Resolution 2
Remove the standby master and add it again. This is an online activity and does not need any downtime.
To remove, use the following:
gpinitstandby -r
To add, use the following:
gpinitstandby -s smdw <<<<< where smdw is the standby master to be configured