gpstart segment error: "failed to connect: Connection refused"

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

gpstart fails to connect segments returning the errors below:

[INFO]:-DBID:62  FAILED  host:' datadir:'/data1/primary/gpseg60' with reason:'Start failed; check segment logfile.  "failed to connect: Connection refused (errno: 111)  failed to connect: Connection refused (errno: 111)  Retrying no 1  failure: timeout  Retrying no 2  failure: OtherTransitionInProgress failure: OtherTransitionInProgress"'exit

20131121:01:55:38:001215 gpstart:mdw:gpadmin-[WARNING]:-FATAL:  DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1534)

DTM initialization: failure during startup recovery

Environment

Resolution

1. Login to the failed segments to check if its postmaster and utility processes exist.
2. If yes, run the shell script below. This will generate a shell script test.sh. This is used to ping each primary segment to find out which segments can not currently accept connections.

PGOPTIONS='-c gp_session_role=utility' psql -d template1 -Atc "copy (select dbid, hostname, port from gp_segment_configuration where role = 'p' and content != -1) to stdout delimiter ' '" | while read dbid host port; do
echo "echo DBID: $dbid"
echo "PGOPTIONS='-c gp_session_role=utility' psql -h $host -p $port -d template1 -c 'select 1;'"
done > test.sh

3. Execute test.sh on master and output to test.out:

chmod 755 test.sh
./test.sh > test.out 2>&1

4. Check which DBIDs can not be connected:

grep starting test.out
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
(...)

5. Go to the DBID which can not be connected and check if there is a startup pass 2 utility process:

gpadmin@sdw5:~> ps -ef|grep 40000
gpadmin 24956 1 0 03:12 ? 00:00:10 /usr/local/greenplum-db-4.2.5.2/bin/postgres -D /data1/primary/gpseg24 -p 40000 -b 26 -z 96 --silent-mode=true -i -M quiescent -C 24
gpadmin 25029 24956 0 03:12 ? 00:00:00 postgres: port 40000, logger process 
gpadmin 25068 24956 0 03:12 ? 00:00:18 postgres: port 40000, filerep transition process 
gpadmin 25069 24956 0 03:12 ? 00:00:05 postgres: port 40000, primary process 
gpadmin 25070 25069 0 03:12 ? 00:00:14 postgres: port 40000, primary receiver ack process 
gpadmin 25071 25069 0 03:12 ? 00:01:14 postgres: port 40000, primary sender process 
gpadmin 25072 25069 0 03:12 ? 00:00:19 postgres: port 40000, primary consumer ack process 
gpadmin 25073 25069 0 03:12 ? 00:00:06 postgres: port 40000, primary recovery process 
gpadmin 25074 25069 0 03:12 ? 00:00:03 postgres: port 40000, primary verification process 
gpadmin 25095 24956 2 03:14 ? 00:04:30 postgres: port 40000, startup pass 2 process 
gpadmin 29266 29231 0 06:31 pts/2 00:00:00 grep 40000

6. Strace the process to make sure it is functioning:

Process 25095 attached - interrupt to quit
semop(1508639228, 0x7fffddb442d0, 1)    = 0
open("global/5090", O_RDWR)             = 9
semop(1508639228, 0x7fffddb41590, 1)    = 0
semop(1508639228, 0x7fffddb41590, 1)    = 0
semop(1508540921, 0x7fffddb41590, 1)    = 0
semop(1508639228, 0x7fffddb45500, 1)    = 0
semop(1508639228, 0x7fffddb45500, 1)    = 0
semop(1508540921, 0x7fffddb45500, 1)    = 0
lseek(9, 848723968, SEEK_SET)           = 848723968
write(9, "\274\0\0\0PkUQ\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768
lseek(9, 653361152, SEEK_SET)           = 653361152
read(9, "\274\0\0\0\260a\363P\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768

In the case above, it is rolling back some large transactions. Before shutdown of GPDB, a "DROP DATABASE" SQL was killed. Since there are too many files in this database, it took several hours to complete recovery. Wait for the recovery process to complete until this primary segment can be connected using utility mode. Also, continue running "test.sh" to monitor if the "starting up" segment count is decreasing.


gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up

gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up

7. Finally, all segments should finish recovery and no more "starting up" segments should exist. If that is the case, Greenplum can be restarted using the following commands:


gpstop -af
gpstart

Note: When primary segments are in recovery, do NOT restart Greenplum immediately after you see the gpstart errors. The primary segments need to recover first.