This article applies to all versions of VMware Tanzu Greenplum Database (GPDB).
This article covers how the parameter gp_segment_connect_timeout works.
For more information see How Greenplum Database Detects a Failed Segment
gp_segment_connect_timeout
is the allotted time for the GPDB interconnect to connect to a segment instance over the network before timing out.
This parameter controls the network connection timeout between master and primary segments. This parameter also determines the network connection timeout for primary segment to mirror segment replication processes.
To check the current value of the parameter use the following command:gpconfig -s gp_segment_connect_timeout
Note: The default value for gp_segment_connect_timeout
is 10 minutes.
[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout Values on all segments are consistent GUC : gp_segment_connect_timeout Master value: 10min Segment value: 10min
The current cluster configuration is:
flightdata=# select * from gp_segment_configuration order by 2; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts ------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------ 1 | -1 | p | p | s | u | 4340 | mdw | mdw | | 10 | -1 | m | m | s | u | 4340 | smdw | smdw | | 2 | 0 | p | p | s | u | 43400 | sdw3 | sdw3 | 49340 | 6 | 0 | m | m | s | u | 59340 | sdw4 | sdw4 | 56340 | 7 | 1 | m | m | s | u | 59341 | sdw4 | sdw4 | 56341 | 3 | 1 | p | p | s | u | 43401 | sdw3 | sdw3 | 49341 | 4 | 2 | p | p | s | u | 43400 | sdw4 | sdw4 | 49340 | 8 | 2 | m | m | s | u | 59340 | sdw3 | sdw3 | 56340 | 5 | 3 | p | p | s | u | 43401 | sdw4 | sdw4 | 49341 | 9 | 3 | m | m | s | u | 59341 | sdw3 | sdw3 | 56341 | (10 rows)
Pick up the mirror segment at seg3 sdw3:59431
. Put this mirror segment's process to sleep:
[gpadmin@sdw3 ~]$ ps -ef | grep 59341 gpadmin 31307 1 0 07:07 ? 00:00:00 /usr/local/greenplum-db/bin/postgres -D /data2/mirror/gpseg32 -p 59341 -b 9 -z 4 --silent-mode=true -i -M quiescent -C 3 gpadmin 31308 31307 0 07:07 ? 00:00:00 postgres: port 59341, logger process gpadmin 31312 31307 0 07:07 ? 00:00:00 postgres: port 59341, mirror process gpadmin 31313 31312 0 07:07 ? 00:00:02 postgres: port 59341, mirror receiver process gpadmin 31314 31312 0 07:07 ? 00:00:02 postgres: port 59341, mirror consumer process gpadmin 31315 31312 0 07:07 ? 00:00:00 postgres: port 59341, mirror consumer writer process gpadmin 31316 31312 0 07:07 ? 00:00:00 postgres: port 59341, mirror consumer append only process gpadmin 31317 31312 0 07:07 ? 00:00:00 postgres: port 59341, mirror sender ack process gpadmin 31318 31312 0 07:07 ? 00:00:00 postgres: port 59341, mirror verification process [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31312 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31313 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31314 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31315 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31316 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31317 [gpadmin@sdw3 ~]$ kill -s SIGSTOP 31318
The primary segment checks the health of the mirror segment. Therefore, the logs of the primary segment needs to be checked to understand why the mirror segment is unreachable.
After 10 minutes, the primary segment logs will start to report that the mirror segment is unable to communicate with the primary segment.
2015-02-18 04:34:20.206197 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=600' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860, 2015-02-18 04:36:55.113403 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running' position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962' ",,0,,"cdbfilerepprimaryack.c",898, 2015-02-18 04:36:55.113494 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'heartBeat' operation 'heart beat' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running' position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962' position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0b8528' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39006' consume count '39006' ",,0,,"cdbfilerepprimary.c",1432, 2015-02-18 04:37:21.830764 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'shutdown' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized' position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963' ",,0,,"cdbfilerepprimaryack.c",898, 2015-02-18 04:37:21.830834 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'shutdown' operation 'shutdown' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized' position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963' position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0ba240' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39007' consume count '39007' ",,0,,"cdbfilerepprimary.c",1311, 2015-02-18 04:37:21.885026 PST,,,p28586,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","filerep main process (PID 30260) exited with exit code 0",,,,,,,0,,"postmaster.c",5876, 2015-02-18 04:37:21.891763 PST,,,p31001,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'sdw4(49341)' mirror address(port) 'sdw3(56341)'",,,,,"mirroring role 'primary role' mirroring state 'change tracking' segment state 'not initialized' process name(pid) 'filerep main process(31001)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3466, 2015-02-18 04:37:21.893572 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: ChangeTracking_RetrieveIsTransitionToInsync() found insync_transition_completed:'true' full resync:'false'",,,,,,,0,,"cdbresynchronizechangetracking.c",2545, 2015-02-18 04:37:21.894052 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","last checkpoint location for generating initial changetracking log 1/38000490",,,,,,,0,,"xlog.c",11560, 2015-02-18 04:37:21.894123 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","scanned through 1 initial xlog records since last checkpoint for writing into the resynchronize change log",,,,,,,0,,"cdbresynchronizechangetracking.c",206,รง
The master ftsprobe
process checks the primary segment's health. It determines that the primary segment of content 2
is healthy but the primary segment cannot communicate with the mirror segment. The primary segment of content 2
requests the ftsprobe
to flag the mirror segment. The logs associated with this process are displayed below:
2015-02-18 04:33:31.855817 PST,,,p21713,th2056423744,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=5, content=3) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,, 2015-02-18 04:33:31.855864 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) reported mirroring fault with mirror (dbid=9), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",358, 2015-02-18 04:33:31.855883 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=5, content=3) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",1157, 2015-02-18 04:33:31.855891 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=9, content=3) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",1157, 2015-02-18 04:33:31.855901 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,x21888,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",976, 2015-02-18 04:33:31.856719 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) on sdw4:43401 transitioning to change-tracking mode, mirror marked as down.",,,,,,,0,,"ftsfilerep.c",498, 2015-02-18 04:33:32.032252 PST,,,p21706,th1770300528,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: Success:",,,,,,,,"SysLoggerMain","syslogger.c",552,
The master then updates the mirror segment's status in its configuration.
flightdata=# select * from gp_segment_configuration where status='d'; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts ------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------ 9 | 3 | m | m | s | d | 59341 | sdw3 | sdw3 | 56341 | (1 row)
If the mirror segment is down, check the primary segment log. If the mirror segment is down due to a gp_segment_connect_timeout, it is attributed to one of the following reasons:
The timeout can be increased, however, it is important to understand why the timeout was experienced. It is possible that the system is overloaded and needs more time to keep the mirror segments in sync.
Please check the "System Administration" guide for the relevant Greenplum version at https://docs.vmware.com/en/VMware-Greenplum/index.html for more details on how to change this parameter.