gpstart fails to connect segments returning the errors below:
[INFO]:-DBID:62 FAILED host:' datadir:'/data1/primary/gpseg60' with reason:'Start failed; check segment logfile. "failed to connect: Connection refused (errno: 111) failed to connect: Connection refused (errno: 111) Retrying no 1 failure: timeout Retrying no 2 failure: OtherTransitionInProgress failure: OtherTransitionInProgress"'exit
20131121:01:55:38:001215 gpstart:mdw:gpadmin-[WARNING]:-FATAL: DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1534)
DTM initialization: failure during startup recovery
test.sh.
This is used to ping each primary segment to find out which segments can not currently accept connections.
PGOPTIONS='-c gp_session_role=utility' psql -d template1 -Atc "copy (select dbid, hostname, port from gp_segment_configuration where role = 'p' and content != -1) to stdout delimiter ' '" | while read dbid host port; do echo "echo DBID: $dbid" echo "PGOPTIONS='-c gp_session_role=utility' psql -h $host -p $port -d template1 -c 'select 1;'" done > test.sh3. Execute
test.sh
on master and output to test.out
:
chmod 755 test.sh ./test.sh > test.out 2>&14. Check which DBIDs can not be connected:
grep starting test.out psql: FATAL: the database system is starting up psql: FATAL: the database system is starting up psql: FATAL: the database system is starting up (...)5. Go to the DBID which can not be connected and check if there is a
startup pass 2
utility process:
gpadmin@sdw5:~> ps -ef|grep 40000 gpadmin 24956 1 0 03:12 ? 00:00:10 /usr/local/greenplum-db-4.2.5.2/bin/postgres -D /data1/primary/gpseg24 -p 40000 -b 26 -z 96 --silent-mode=true -i -M quiescent -C 24 gpadmin 25029 24956 0 03:12 ? 00:00:00 postgres: port 40000, logger process gpadmin 25068 24956 0 03:12 ? 00:00:18 postgres: port 40000, filerep transition process gpadmin 25069 24956 0 03:12 ? 00:00:05 postgres: port 40000, primary process gpadmin 25070 25069 0 03:12 ? 00:00:14 postgres: port 40000, primary receiver ack process gpadmin 25071 25069 0 03:12 ? 00:01:14 postgres: port 40000, primary sender process gpadmin 25072 25069 0 03:12 ? 00:00:19 postgres: port 40000, primary consumer ack process gpadmin 25073 25069 0 03:12 ? 00:00:06 postgres: port 40000, primary recovery process gpadmin 25074 25069 0 03:12 ? 00:00:03 postgres: port 40000, primary verification process gpadmin 25095 24956 2 03:14 ? 00:04:30 postgres: port 40000, startup pass 2 process gpadmin 29266 29231 0 06:31 pts/2 00:00:00 grep 400006. Strace the process to make sure it is functioning:
Process 25095 attached - interrupt to quit semop(1508639228, 0x7fffddb442d0, 1) = 0 open("global/5090", O_RDWR) = 9 semop(1508639228, 0x7fffddb41590, 1) = 0 semop(1508639228, 0x7fffddb41590, 1) = 0 semop(1508540921, 0x7fffddb41590, 1) = 0 semop(1508639228, 0x7fffddb45500, 1) = 0 semop(1508639228, 0x7fffddb45500, 1) = 0 semop(1508540921, 0x7fffddb45500, 1) = 0 lseek(9, 848723968, SEEK_SET) = 848723968 write(9, "\274\0\0\0PkUQ\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768 lseek(9, 653361152, SEEK_SET) = 653361152 read(9, "\274\0\0\0\260a\363P\1\0\0\0\364\3\0\4\0\200\4\200\200\377\374\0\0\377\374\0\200\376\374\0"..., 32768) = 32768
In the case above, it is rolling back some large transactions. Before shutdown of GPDB, a "DROP DATABASE" SQL was killed. Since there are too many files in this database, it took several hours to complete recovery. Wait for the recovery process to complete until this primary segment can be connected using utility mode. Also, continue running "test.sh
" to monitor if the "starting up
" segment count is decreasing.
gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
gpadmin@mdw:~/> ./test.sh |grep starting
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
psql: FATAL: the database system is starting up
7. Finally, all segments should finish recovery and no more "starting up" segments should exist. If that is the case, Greenplum can be restarted using the following commands:
gpstop -af
gpstart
Note: When primary segments are in recovery, do NOT restart Greenplum immediately after you see the gpstart errors. The primary segments need to recover first.