Greenplum appears to be hanging in parallel with a segment recovery during gpstart

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:
When running the gpstart, it appears to be hanging for some time and command prompt is not returned.

The following is an example output from gpstart:

20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[WARNING]:-****************************************************************************
20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[WARNING]:-There are 2 segment(s) marked down in the database
20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[WARNING]:-management utility which will recover failed segment instance databases.
20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[WARNING]:-****************************************************************************
20180803:20:58:55:020360 gpstart:mdw1:gpadmin-[INFO]:-Starting Master instance mdw1 directory /data/master/gpseg-1 
20180803:20:58:56:020360 gpstart:mdw1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw1 instance active
20180803:20:58:56:020360 gpstart:mdw1:gpadmin-[DEBUG]:-Connecting to dbname='template1'

The command prompt does not return even when the gpstart has actually completed its run. The Psql hangs and there are startup connections when checked with the Process Status (ps) command.

Example:

[gpadmin@mdw ~]$ ps -ef | grep postgres  grep -i startup 
gpadmin   83129  82178  0 20:58 ?        00:00:00 postgres: port  5432, gpadmin template1 host(port) con5 host(port) startup

While checking via gpstate, it is seen that one or more segments are trying to recover or resynchronize.

Example:

20180803:21:06:55:020928 gpstate:mdw1:gpadmin-[INFO]:-Segment Pairs in Resynchronization
20180803:21:06:55:020928 gpstate:mdw1:gpadmin-[INFO]:-   Current Primary   Port   Resync mode   Est. resync progress   Total resync objects   Objects to resync   Data synced   Est. total to sync   Est. resync end time   Change tracking size   Mirror   Port
20180803:21:06:55:020928 gpstate:mdw1:gpadmin-[INFO]:-   sdw1              1153   Incremental   Not Available          0                      0                   0 bytes       Not Available        Not Available          518 GB                 sdw2     1025
...
20180803:21:06:55:020928 gpstate:mdw1:gpadmin-[INFO]:-   sdw4              1156   Incremental   Not Available          0                      0                   0 bytes       Not Available        Not Available          538 GB                 sdw3     1028

Environment

Cause

This can occur when an existing segment recovery is still running, such as from a running gprecoverseg. Then a Greenplum restart is attempted.

This a known scenario and is by design.

By design, gpstop does not check if a segment recovery is in progress. During the gpstart, if the segments are in the resync mode then the recovery is resumed. It will verify steps which gprecoverseg has completed. A progress cannot be checked in this case and can result in the extended time spent by gpstart.

The database does not accept connections until the recovery is completed. So, to avoid this, do not restart Greenplum while any recovery is running. Make sure the recovery is completed when you restart the server.

Resolution

While gpstart attempt is still running and looks to be hanging, check for segment pairs in the resynchronization with:

gpstate -e

From the output, identify all of the primary and mirror segment pairs in the resynchronization.

Example output:

Current Primary   Port   Resync mode   Est. resync progress   Total resync objects   Objects to resync   Data synced   Est. total to sync   Est. resync end time   Change tracking size   Mirror   Port

sdw1              1153   Incremental   Not Available          0                      0                   0 bytes       Not Available        Not Available          518 GB                 sdw2     1025
sdw4              1156   Incremental   Not Available          0                      0                   0 bytes       Not Available        Not Available          538 GB                 sdw3     1028

Identify the mirror hosts and mirror ports

Example of mirror hosts and mirror ports from the above:

sdw2:1025
sdw3:1028

As a gpadmin user, log in to each of the hosts shown and identify the Mirror Segment postgres process so you can bring them down

Example:

$ ssh -l gpadmin sdw2

[gpadmin@sdw2 ~]$ ps -ef | grep silent | grep 1025
gpadmin  29979     1  0 Aug05 ?        00:00:01 /usr/local/greenplum-db/bin/postgres -D /data1/mirror/gpseg0 -p 1025 -b 5 -z 3 --silent-mode=true -i -M quiescent -C 0

Source the current greenplum_path.sh on the segment host.

Example:

[gpadmin@sdw2 ~]$ source /usr/local/greenplum-db/greenplum_path.sh

Using the Greenplum pg_ctl, stop the Mirror Segments identified:
```
[gpadmin@sdw2 ~]$ pg_ctl -D /data1/mirror/gpseg0 stop -m fast
```
Once all the Segment Pairs in the resynchronization have the Mirror Segment stopped, verify that the gpstart is complete.
Then, run a full recovery using gprecoverseg from the Master host:
```
[gpadmin@mdw ~]$ gprecoverseg -F
```
Wait for all the segments to complete their resynchronization and monitor until they finish this process with gpstate -e and gpstate -m.