This article explains how the parameters gp_fts_probe_timeout
and gp_fts_probe_retries
work in all versions of Pivotal Greenplum Database (GPDB).
Refer to this article for more information about how fault tolerance server (FTS) works.
gp_fts_probe_timeout
is the allowed timeout for the fault detection process (ftsprobe
) to establish a connection to a segment before declaring it down.
gp_fts_probe_retries
is the number of attempts the segments are probed.
For example, assume the default timeout parameter value is:
[gpadmin@###~]$ gpconfig -s gp_fts_probe_timeout Values on all segments are consistent GUC : gp_fts_probe_timeout Master value: 20s Segment value: 20s
The cluster information with primary segments is:
flightdata=# select * from gp_segment_configuration where role='p'; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts ------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------ 1 | -1 | p | p | s | u | 4340 | mdw | mdw | | 2 | 0 | p | p | s | u | 43400 | sdw3 | sdw3 | 49340 | 3 | 1 | p | p | s | u | 43401 | sdw3 | sdw3 | 49341 | 4 | 2 | p | p | s | u | 43400 | sdw4 | sdw4 | 49340 | 5 | 3 | p | p | s | u | 43401 | sdw4 | sdw4 | 49341 | (5 rows)
The ftsprobe
of the master only communicates to the primary segments and not to the mirror. Therefore, the gp_fts_probe_timeout
is the time allowed for a primary segment to respond back to the master call. The mirror is contacted by the primary segment for its health.
The FTS probe that processes information for a cluster is running on port 4340
is:
[gpadmin@### ~]$ ps -ef | grep -i fts | grep 4340 gpadmin 16695 16687 0 Feb12 ? 00:00:00 postgres: port 4340, ftsprobe process
Interrupt a primary segment process communication by asking the primary process postmaster to sleep:
[gpadmin@### ~]$ ssh sdw3 Last login: Tue Feb 17 10:45:54 2015 from mdw [gpadmin@### ~]$ ps -ef | grep 43400 | grep silent gpadmin 10769 1 0 Feb12 ? 00:00:05 /usr/local/greenplum-db-4.3.4.0/bin/postgres -D /data1/primary/fai_43400 -p 43400 -b 2 -z 4 --silent-mode=true -i -M quiescent -C 0 [gpadmin@### ~]$ kill -s SIGSTOP 10769 [gpadmin@### ~]$
When on master, the FTS probe sometimes fails based on the parameter gp_fts_probe_interval
. Turn on the strace
for the ftsprobe
process. Notice that the ftsprobe
process goes into FUTEX WAIT
mode during this particular ping.
[gpadmin@### ~]$ strace -p 16695 Process 16695 attached - interrupt to quit select(0, NULL, NULL, NULL, {36, 403000} ) = 0 (Timeout) getppid() = 16687 lseek(12, 0, SEEK_END) = 32768 lseek(13, 0, SEEK_END) = 32768 clone(child_stack=0x2b2f68969230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689699d0, tls=0x2b2f68969940, child_tidptr=0x2b2f689699d0) = 18012 clone(child_stack=0x2b2f689aa230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689aa9d0, tls=0x2b2f689aa940, child_tidptr=0x2b2f689aa9d0) = 18013 clone(child_stack=0x2b2f689eb230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f689eb9d0, tls=0x2b2f689eb940, child_tidptr=0x2b2f689eb9d0) = 18014 clone(child_stack=0x2b2f68a2c230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68a2c9d0, tls=0x2b2f68a2c940, child_tidptr=0x2b2f68a2c9d0) = 18015 clone(child_stack=0x2b2f68a6d230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68a6d9d0, tls=0x2b2f68a6d940, child_tidptr=0x2b2f68a6d9d0) = 18016 clone(child_stack=0x2b2f68aae230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68aae9d0, tls=0x2b2f68aae940, child_tidptr=0x2b2f68aae9d0) = 18017 clone(child_stack=0x2b2f68aef230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68aef9d0, tls=0x2b2f68aef940, child_tidptr=0x2b2f68aef9d0) = 18018 clone(child_stack=0x2b2f68b30230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68b309d0, tls=0x2b2f68b30940, child_tidptr=0x2b2f68b309d0) = 18019 clone(child_stack=0x2b2f68b71230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68b719d0, tls=0x2b2f68b71940, child_tidptr=0x2b2f68b719d0) = 18020 clone(child_stack=0x2b2f68bb2230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68bb29d0, tls=0x2b2f68bb2940, child_tidptr=0x2b2f68bb29d0) = 18021 clone(child_stack=0x2b2f68bf3230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68bf39d0, tls=0x2b2f68bf3940, child_tidptr=0x2b2f68bf39d0) = 18022 clone(child_stack=0x2b2f68c34230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68c349d0, tls=0x2b2f68c34940, child_tidptr=0x2b2f68c349d0) = 18023 clone(child_stack=0x2b2f68c75230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68c759d0, tls=0x2b2f68c75940, child_tidptr=0x2b2f68c759d0) = 18024 clone(child_stack=0x2b2f68cb6230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68cb69d0, tls=0x2b2f68cb6940, child_tidptr=0x2b2f68cb69d0) = 18025 clone(child_stack=0x2b2f68cf7230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68cf79d0, tls=0x2b2f68cf7940, child_tidptr=0x2b2f68cf79d0) = 18026 clone(child_stack=0x2b2f68d38230, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b2f68d389d0, tls=0x2b2f68d38940, child_tidptr=0x2b2f68d389d0) = 18027 futex(0x2b2f689699d0, FUTEX_WAIT, 18012, NULL) = 0
After 20 seconds of not being able to communicate with the primary segments, the master logs emit the information below:
2015-02-18 03:06:44.095779 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 21000 ms.",,,,,,,0,,,, 2015-02-18 03:06:45.095721 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 1 to probe segment (content=0, dbid=2).",,,,,,,0,,,, 2015-02-18 03:07:05.096777 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,, 2015-02-18 03:07:06.096726 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 2 to probe segment (content=0, dbid=2).",,,,,,,0,,,, 2015-02-18 03:07:26.097782 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,, 2015-02-18 03:07:27.097728 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 3 to probe segment (content=0, dbid=2).",,,,,,,0,,,, 2015-02-18 03:07:47.098792 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20001 ms.",,,,,,,0,,,, 2015-02-18 03:07:48.098726 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: retry 4 to probe segment (content=0, dbid=2).",,,,,,,0,,,, 2015-02-18 03:08:09.098705 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) due to timeout expiration, probe elapsed time: 20999 ms.",,,,,,,0,,,, 2015-02-18 03:08:09.098740 PST,,,p16695,th1754700096,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: failed to probe segment (content=0, dbid=2) after trying 5 time(s), maximum number of retries reached.",,,,,,,0,,,, 2015-02-18 03:08:09.103340 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=2, content=0) from ('u','p') to ('d','m')",,,,,,,0,,"fts.c",1157, 2015-02-18 03:08:09.103407 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=6, content=0) from ('u','m') to ('u','p')",,,,,,,0,,"fts.c",1157, 2015-02-18 03:08:09.103504 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,x21745,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",976, 2015-02-18 03:08:09.123765 PST,,,p16695,th1470440560,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: mirror (dbid=6) on sdw4:59340 taking over as primary in change-tracking mode.",,,,,,,0,,"ftsfilerep.c",492, 2015-02-18 03:08:09.130391 PST,,,p16695,th1758693696,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=6, content=0) has not reached new state yet, expected state: ('p','c'), reported state: ('m','s').",,,,,,,0,,,, 2015-02-18 03:08:09.248675 PST,,,p16688,th1470440560,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: Success:",,,,,,,,"SysLoggerMain","syslogger.c",552,
The default value of the gp_fts_probe_retries
parameter is 5. The ftsprobe
retries 5 times as per parameter gp_fts_probe_retries
before marking them down.
Note: gp_fts_probe_retries
cannot be modified.
[gpadmin@### ~]$ gpconfig -s gp_fts_probe_retries; Values on all segments are consistent GUC : gp_fts_probe_retries Master value: 5 Segment value: 5
The timeout ends and the segments are marked down. Then, the ftsprobe
updates the postmaster based on changes.
flightdata=# select * from gp_segment_configuration where status='d'; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts ------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------ 2 | 0 | m | p | s | d | 43400 | sdw3 | sdw3 | 49340 |
The remaining trace information of the ftsprobe
after the timeout is:
select(3, NULL, [2], NULL, {1, 0}) = 1 (out [2], left {1, 0}) write(2, "\0\0\0\0\240\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 208) = 208 select(3, NULL, [2], NULL, {1, 0}) = 1 (out [2], left {1, 0}) write(2, "\0\0\0\0\240\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 208) = 208 select(3, NULL, [2], NULL, {1, 0}) = 1 (out [2], left {1, 0}) write(2, "\0\0\0\0y\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\t\0\0\0\0"..., 169) = 169 brk(0x949b000) = 0x949b000 brk(0x9494000) = 0x9494000 open("base/10899/5095", O_RDWR) = 15 read(15, "\0\0\0\0\310J\2\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768 read(15, "\0\0\0\0\210\211x\4\1\0\0\0\324\3\360p\360\177\4\200\340\377 \0\320\377 \0\300\377 \0"..., 32768) = 32768 open("base/10899/5094", O_RDWR) = 16 read(16, "\0\0\0\0H\216\214\4\1\0\1\0\354\3\200K\0\200\4\200\310\377p\0\220\377p\0X\377p\0"..., 32768) = 32768 open("global/5006", O_RDWR) = 17 lseek(17, 0, SEEK_END) = 32768 open("global/6107", O_RDWR) = 18 read(18, "\0\0\0\0\340&\216\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768 read(18, "\0\0\0\0H\222\372\20\1\0\0\0<\0P\177\360\177\4\200\340\377 \0\320\377 \0\260\377 \0"..., 32768) = 32768 brk(0x94b5000) = 0x94b5000 brk(0x94ae000) = 0x94ae000 open("base/10899/2678", O_RDWR) = 19 read(19, "\0\0\0\0 \0\0\0\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768 read(19, "\0\0\0\0\220vr\4\1\0\0\0008\2`w\360\177\4\200\340\377 \0\320\377 \0\300\377 \0"..., 32768) = 32768 open("global/6106", O_RDWR) = 20 read(20, "\0\0\0\0@&\216\4\1\0\0\0,\0\360\177\360\177\4\200b1\5\0\2\0\0\0\1\0\0\0"..., 32768) = 32768 read(20, "\0\0\0\0\0\221\372\20\1\0\0\0<\0P\177\360\177\4\200P\377 \0\340\377 \0\300\377 \0"..., 32768) = 32768 open("pg_xlog/000000010000000000000005", O_RDWR) = 21 lseek(21, 40337408, SEEK_SET) = 40337408 write(21, "^\320\1\0\1\0\0\0\0\0\0\0\0\200g\26\364\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 fsync(21) = 0 kill(16730, SIGUSR1) = 0 getppid() = 16687 read(3, 0x7fff10ca6300, 16) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}, {fd=-1}], 2, -1) = -1 EINTR (Interrupted system call) --- SIGUSR1 (User defined signal 1) @ 0 (0) --- write(4, "\0", 1) = 1 rt_sigreturn(0x2) = -1 EINTR (Interrupted system call) read(3, "\0", 16) = 1 semctl(390955089, 7, SETVAL, 0) = 0 select(3, NULL, [2], NULL, {1, 0}) = 1 (out [2], left {1, 0}) write(2, "\0\0\0\0\254\0\0\0007A\0\0p$\245W\0\0\0\0\0\0\0\0tcf\0\0\0\0\0"..., 220) = 220 rt_sigaction(SIGINT, {0x1, [], SA_RESTORER, 0x3bd50302d0}, {0x784f10, [], SA_RESTORER|SA_RESTART, 0x3bd5c0eca0}, 8) = 0 rt_sigaction(SIGQUIT, {0x1, [], SA_RESTORER, 0x3bd50302d0}, {0x95c5a0, [], SA_RESTORER|SA_RESTART, 0x3bd5c0eca0}, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fff10ca54b8) = 18211 wait4(18211, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 18211 rt_sigaction(SIGINT, {0x784f10, [], SA_RESTORER|SA_RESTART, 0x3bd50302d0}, NULL, 8) = 0 rt_sigaction(SIGQUIT, {0x95c5a0, [], SA_RESTORER|SA_RESTART, 0x3bd50302d0}, NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 -- The primary segment that was stopped from communication was continued and recovered using gprecoverseg. [gpadmin@### ~]$ kill -s SIGCONT 10769 [gpadmin@### ~]$ gprecoverseg -a
When the master log reports segments being down due to a timeout expiration, it is a result of the primary postgres process not communicating with the ftsprobe
pings.
A few reasons why this happens are listed below:
1. The network or server is not reachable.
2. The server has a high load, delaying the response (Check Sar, IOSTAT etc.).
3. The postmaster or the primary segments process is stuck or unresponsive. This a result of the sudden termination of the child process or anything related (check the primary logs for more information like OOM, PANIC, FATAL etc.).
4. Other unknown factors.