gp_segment_connect_timeout - explained
search cancel

gp_segment_connect_timeout - explained

book

Article ID: 295164

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

This article applies to all versions of VMware Tanzu Greenplum Database (GPDB).

This article covers how the parameter gp_segment_connect_timeout works.

For more information see How Greenplum Database Detects a Failed Segment

 

 

Resolution

gp_segment_connect_timeout is the allotted time for the GPDB interconnect to connect to a segment instance over the network before timing out.

This parameter controls the network connection timeout between master and primary segments. This parameter also determines the network connection timeout for primary segment to mirror segment replication processes.

To check the current value of the parameter use the following command:

gpconfig -s gp_segment_connect_timeout

Note: The default value for gp_segment_connect_timeout is 10 minutes. 

[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout
Values on all segments are consistent
GUC          : gp_segment_connect_timeout
Master  value: 10min
Segment value: 10min

The current cluster configuration is:

flightdata=# select * from gp_segment_configuration order by 2;
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    1 |      -1 | p    | p              | s    | u      |  4340 | mdw      | mdw     |                  |
   10 |      -1 | m    | m              | s    | u      |  4340 | smdw     | smdw    |                  |
    2 |       0 | p    | p              | s    | u      | 43400 | sdw3     | sdw3    |            49340 |
    6 |       0 | m    | m              | s    | u      | 59340 | sdw4     | sdw4    |            56340 |
    7 |       1 | m    | m              | s    | u      | 59341 | sdw4     | sdw4    |            56341 |
    3 |       1 | p    | p              | s    | u      | 43401 | sdw3     | sdw3    |            49341 |
    4 |       2 | p    | p              | s    | u      | 43400 | sdw4     | sdw4    |            49340 |
    8 |       2 | m    | m              | s    | u      | 59340 | sdw3     | sdw3    |            56340 |
    5 |       3 | p    | p              | s    | u      | 43401 | sdw4     | sdw4    |            49341 |
    9 |       3 | m    | m              | s    | u      | 59341 | sdw3     | sdw3    |            56341 |
(10 rows)

Pick up the mirror segment at seg3 sdw3:59431. Put this mirror segment's process to sleep:

[gpadmin@sdw3 ~]$ ps -ef | grep 59341
gpadmin  31307     1  0 07:07 ?        00:00:00 /usr/local/greenplum-db/bin/postgres -D /data2/mirror/gpseg32 -p 59341 -b 9 -z 4 --silent-mode=true -i -M quiescent -C 3
gpadmin  31308 31307  0 07:07 ?        00:00:00 postgres: port 59341, logger process
gpadmin  31312 31307  0 07:07 ?        00:00:00 postgres: port 59341, mirror process
gpadmin  31313 31312  0 07:07 ?        00:00:02 postgres: port 59341, mirror receiver process
gpadmin  31314 31312  0 07:07 ?        00:00:02 postgres: port 59341, mirror consumer process
gpadmin  31315 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror consumer writer process
gpadmin  31316 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror consumer append only process
gpadmin  31317 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror sender ack process
gpadmin  31318 31312  0 07:07 ?        00:00:00 postgres: port 59341, mirror verification process

[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31312
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31313
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31314
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31315
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31316
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31317
[gpadmin@sdw3 ~]$ kill -s SIGSTOP 31318

The primary segment checks the health of the mirror segment. Therefore, the logs of the primary segment needs to be checked to understand why the mirror segment is unreachable.


After 10 minutes, the primary segment logs will start to report that the mirror segment is unable to communicate with the primary segment. 

2015-02-18 04:34:20.206197 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=600' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,
2015-02-18 04:36:55.113403 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962' ",,0,,"cdbfilerepprimaryack.c",898,
2015-02-18 04:36:55.113494 PST,,,p30264,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'heartBeat' operation 'heart beat' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(30264)' filerep state 'up and running'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3962'
position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0b8528' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39006' consume count '39006' ",,0,,"cdbfilerepprimary.c",1432,
2015-02-18 04:37:21.830764 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'shutdown' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963' ",,0,,"cdbfilerepprimaryack.c",898,
2015-02-18 04:37:21.830834 PST,,,p30260,th1291128944,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'shutdown' operation 'shutdown' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in shutdown' process name(pid) 'filerep main process(30260)' filerep state 'not initialized'
position ack begin '0x2baa5d5fc040' position ack end '0x2baa5d67c040' position ack insert '0x2baa5d648630' position ack consume '0x2baa5d648630' position ack wraparound '0x2baa5d67c040' insert count ack '3962' consume count ack '3963'
position begin '0x2baa5cdfb040' position end '0x2baa5d5fb100' position insert '0x2baa5d0ba240' position consume '0x2baa5d0b8528' position wraparound '0x2baa5d5fb100' insert count '39007' consume count '39007' ",,0,,"cdbfilerepprimary.c",1311,
2015-02-18 04:37:21.885026 PST,,,p28586,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","filerep main process (PID 30260) exited with exit code 0",,,,,,,0,,"postmaster.c",5876,
2015-02-18 04:37:21.891763 PST,,,p31001,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) 'sdw4(49341)' mirror address(port) 'sdw3(56341)'",,,,,"mirroring role 'primary role' mirroring state 'change tracking' segment state 'not initialized' process name(pid) 'filerep main process(31001)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3466,
2015-02-18 04:37:21.893572 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","CHANGETRACKING: ChangeTracking_RetrieveIsTransitionToInsync() found insync_transition_completed:'true' full resync:'false'",,,,,,,0,,"cdbresynchronizechangetracking.c",2545,
2015-02-18 04:37:21.894052 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","last checkpoint location for generating initial changetracking log 1/38000490",,,,,,,0,,"xlog.c",11560,
2015-02-18 04:37:21.894123 PST,,,p31002,th1291128944,,,,0,,,seg-1,,,,,"LOG","00000","scanned through 1 initial xlog records since last checkpoint for writing into the resynchronize change log",,,,,,,0,,"cdbresynchronizechangetracking.c",206,รง

The master ftsprobe process checks the primary segment's health. It determines that the primary segment of content 2 is healthy but the primary segment cannot communicate with the mirror segment. The primary segment of content 2 requests the ftsprobe to flag the mirror segment. The logs associated with this process are displayed below: 

2015-02-18 04:33:31.855817 PST,,,p21713,th2056423744,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=5, content=3) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,,
2015-02-18 04:33:31.855864 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) reported mirroring fault with mirror (dbid=9), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",358,
2015-02-18 04:33:31.855883 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=5, content=3) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",1157,
2015-02-18 04:33:31.855891 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=9, content=3) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",1157,
2015-02-18 04:33:31.855901 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,x21888,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",976,
2015-02-18 04:33:31.856719 PST,,,p21713,th1770300528,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=5) on sdw4:43401 transitioning to change-tracking mode, mirror marked as down.",,,,,,,0,,"ftsfilerep.c",498,
2015-02-18 04:33:32.032252 PST,,,p21706,th1770300528,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
Success:",,,,,,,,"SysLoggerMain","syslogger.c",552,

The master then updates the mirror segment's status in its configuration.

flightdata=# select * from gp_segment_configuration where status='d';
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    9 |       3 | m    | m              | s    | d      | 59341 | sdw3     | sdw3    |            56341 |
(1 row)

Additional Information

If the mirror segment is down, check the primary segment log. If the mirror segment is down due to a gp_segment_connect_timeout, it is attributed to one of the following reasons:
 

  • The network is slow
  • The load on the server is high
  • The server unreachable 

The timeout can be increased, however, it is important to understand why the timeout was experienced. It is possible that the system is overloaded and needs more time to keep the mirror segments in sync.

Please check the "System Administration" guide for the relevant Greenplum version at https://docs.vmware.com/en/VMware-Greenplum/index.html for more details on how to change this parameter.