There are several articles which go through how the Fault Tolerance Server (FTS) process in the master works to make sure there is a timely response from the primary and mirror segments to the master. You can click on the following link to read more about the FTS (Fault Tolerance) process in the master.
This article will expand on FTS and focus more on primary or mirror fault detection using the filerep
process and what to check for on the network side if you are facing issues with segment level connectivity between primary and mirror segments.
When a segments goes down, usually you can check for FTS probe messages in the master log. If there's no such message in the master log and there are no other obvious symptoms, then you may be facing an issue with primary or mirror segment connectivity. You will generally see the following types of messages in the log file:
2019-07-10 13:40:29.571788 EDT,,,p13044,th1321363200,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=2, content=0) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,, 2019-07-10 13:40:29.575816 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=2) reported mirroring fault with mirror (dbid=6), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",371, 2019-07-10 13:40:29.575859 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=2, content=0) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",975, 2019-07-10 13:40:29.575878 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=6, content=0) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",975, 2019-07-10 13:40:29.575895 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",835, 2019-07-10 13:40:29.604825 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=2) on sdw1:40000 transitioning to change-tracking mode, mirror marked as down.",,,,,,,0,,"ftsfilerep.c",509,
2019-07-10 13:40:06.611704 EDT,,,p8887,th907028288,,,1999-12-31 19:00:00 EST,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343, 2019-07-10 13:40:06.748137 EDT,,,p8887,th907028288,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' filerep state 'up and running' process name(pid) 'primary receiver ack process( 8887)' 'cdbfilerep.c' 'L2444' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1824,"Stack trace: 1 0x96016b postgres errstart (elog.c:521) 2 0xa12054 postgres FileRep_InsertConfigLogEntryInternal (cdbfilerep.c:1806) 3 0xa240e7 postgres FileRepAckPrimary_StartReceiver (cdbfilerepprimaryack.c:159) 4 0xa15ff5 postgres FileRepSubProcess_Main (cdbfilerepservice.c:833) 5 0xa10486 postgres <symbol not found> (cdbfilerep.c:2667) 6 0xa1511c postgres FileRep_Main (cdbfilerep.c:3549) 7 0x588336 postgres AuxiliaryProcessMain (bootstrap.c:503) 8 0x7d302b postgres <symbol not found> (postmaster.c:7395) 9 0x7d79ac postgres StartFilerepProcesses (postmaster.c:1622) 10 0x7e12a9 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1760) 11 0x7dc071 postgres <symbol not found> (postmaster.c:2465) 12 0x7de2ba postgres PostmasterMain (postmaster.c:1533) 13 0x4cb1b7 postgres main (main.c:206) 14 0x7f523162eb15 libc.so.6 __libc_start_main + 0xf5 15 2019-07-10 13:40:06.755898 EDT,,,p8890,th907028288,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(8890)' filerep state 'fault' position ack begin '0x7f5226a92040' position ack end '0x7f5226b12040' position ack insert '0x7f5226a95118' position ack consume '0x7f5226a95118' position ack wraparound '0x7f5226b12040' insert count ack '128' consume count ack '128' ",,0,,"cdbfilerepprimaryack.c",875, 2019-07-10 13:40:06.755949 EDT,,,p8890,th907028288,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'heartBeat' operation 'heart beat' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(8890)' filerep state 'fault' position ack begin '0x7f5226a92040' position ack end '0x7f5226b12040' position ack insert '0x7f5226a95118' position ack consume '0x7f5226a95118' position ack wraparound '0x7f5226b12040' insert count ack '128' consume count ack '128' position begin '0x7f5226291040' position end '0x7f5226a91100' position insert '0x7f52265cc348' position consume '0x7f52265cc348' position wraparound '0x7f5226a91100' insert count '637' consume count '637' ",,0,,"cdbfilerepprimary.c",1371,
You can run gprecovereg
as the message suggests to re-establish mirror connectivity, however it is important to check why the segment went down in the first place, otherwise it may not complete if the issue is still there. You can run into this situation if there are no problems with the network connection between the master and segment hosts which means that the FTS probe process will not detect the issue. Instead, there is a separate File Replication process (filerep
) that ensures the primary and mirror segment pair stays in sync.
There are two things to consider when it comes to determining if the primary or mirror pair are no longer in sync and they are controlled by two separate groups of GUCs.
gp_filerep_tcp_keepalives_count
: How many keepalives
may be lost before the connection is considered dead. (Default is 2)gp_filerep_tcp_keepalives_idle
: Number of seconds between sending keepalives on an otherwise idle connection. (Default is 1 minute)gp_filerep_tcp_keepalives_interval
: How many seconds to wait for a response to a keepalive before retransmitting. (Default 30 seconds)gp_filerep_tcp_keepalives_idle
) + (gp_filerep_tcp_keepalives_interval * gp_filerep_tcp_keepalives_count
)If we go with the default values then Segment Failover will occur after (1 minute + (30 seconds * 2)) = 2 minutes.
This generally occurs when a TCP connection is unable to receive any data from the destination host in cases where the replication port is blocked or busy. The connection will go into "keepalive
" state which means:
1. It will check the current TCP connection to make sure it is still operating.
2. If the current connection does not respond, the client will try to keep the connection alive by sending a keepalive
probe.
3. It will look for a response as defined by gp_filerep_tcp_keepalives_interval
. If still no response, then It will retry steps 2 and 3 again up to the number of time defined by gp_filerep_tcp_keepalives_count
.
4. If it fails and the keepalive
count is reached, the TCP connection is deleted and FTS on the master is informed and the mirror is marked down.
If TCP_Keepalive
is causing the segment to go down, you will see this in the primary log:
2019-07-10 13:40:06.611704 EDT,,,p8887,th907028288,,,1999-12-31 19:00:00 EST,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343,
If the gp_filerep
GUCs are set to 0, the connection will default to the operating system settings which you can check with this command:
sdw1:~# grep -H . /proc/sys/net/ipv4/tcp_keep* /proc/sys/net/ipv4/tcp_keepalive_intvl:30 /proc/sys/net/ipv4/tcp_keepalive_probes:8 /proc/sys/net/ipv4/tcp_keepalive_time:120
gp_segment_connect_timeout
The gp_segment_connect_timeout
can also used to determine primary/mirror connectivity however by default this value is usually set higher than the filerep
GUCs so this usually only comes into play if this value is set lower or during recovery operations (gprecoverseg
).
[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout Values on all segments are consistent GUC : gp_segment_connect_timeout Master value: 10min Segment value: 10min
Ten minutes is the default value. Usually if there is a network issue, the filerep
timeout will be hit first, but if you set gp_segment_connect_timeout
to a lower value than filerep
(or if TCP Keepalive
is disabled), then the gp_segment_connect_timeout
will take precedence. You will see messages such as these in the primary segment:
2019-07-10 17:14:52.304062 EDT,,,p703,th-24942784,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=600' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_ segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",837, 2019-07-10 17:17:27.445597 EDT,,,p703,th-24942784,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",, ,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(703)' filerep state 'up and running' position ack begin '0x7fceef1c6040' position ack end '0x7fceef246040' position ack insert '0x7fceef1e9a18' position ack consume '0x7fceef1e9a18' position ack wraparound '0x7fceef246040' insert count ack '410' consume count ack '410' ",,0,,"cdbfilerepprimaryack.c",875
Segment failover time = (heartBeat
) + (gp_segment_connect_timeout
)
The next section will show what the heartBeat
is. A segment pair will send a heartBeat
to each other about once per minute. If the sender doesn't hear back from the receiver, then the gp_segment_connect_timeout
timer will start.
If the segment goes down due to hitting the gp_segment_connect_timeout,
in most cases it will be because the segment_data_directory
file system is busy or inaccessible and is not able to send a proper response back. Other possibilities include high load or busy I/O.
RX-ERR
, RX-DROP
, TX-ERR
, or TX-DRP
is showing increases, then it's best to check with your network team:
segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 34443562 0 2204 0 19468133 0 0 0 BMRU
lo 65536 10080380 0 0 0 10080380 0 0 0 LRU
2. Connecting from primary host to the mirror host or vice versa through the replication port is giving a "Connection refused
"
or "No route to host
".
This could indicate that a firewall is blocking the port or some other cause that is preventing the connection from being established. You can use telnet
or ncat
to check TCP connectivity.Success:
segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ nc -v sdw2 51000
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Connected to 10.193.102.205:51000.
Failure:
segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ nc -v sdw2 51000
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: No route to host.
If firewall is not enabled, then sometimes a reboot of the host will solve the issue. If not, it needs further investigation with the network team.gp_segment_connect_timeout
GUC value. pg_xlog
, pg_clog
, pg_distributedlog
and/or global directory in the segment data directory.gpadmin=# select * from gp_segment_configuration where content = 0 order by 1; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port ------+---------+------+----------------+------+--------+-------+----------+---------+------------------ 2 | 0 | p | p | s | u | 40000 | sdw1 | sdw1 | 41000 6 | 0 | m | m | s | u | 50000 | sdw2 | sdw2 | 51000The primary segment will have the following processes:
segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ ps -ef | grep 40000 gpadmin 20533 20485 0 14:55 pts/0 00:00:00 grep --color=auto 40000 gpadmin 26623 1 0 Jun14 ? 00:00:13 /var/vcap/store/gpadmin/greenplum-db-5.17.0/bin/postgres -D /data1/primary/gpseg0 -p 40000 --gp_dbid=2 --gp_num_contents_in_cluster=4 --silent-mode=true -i -M quiescent --gp_contentid=0 gpadmin 26627 26623 0 Jun14 ? 00:00:06 postgres: 40000, logger process gpadmin 26648 26623 0 Jun14 ? 00:00:06 postgres: 40000, primary process gpadmin 26659 26648 0 Jun14 ? 00:00:02 postgres: 40000, primary receiver ack process gpadmin 26660 26648 0 Jun14 ? 00:00:04 postgres: 40000, primary sender process gpadmin 26661 26648 0 Jun14 ? 00:00:01 postgres: 40000, primary consumer ack process gpadmin 26662 26648 0 Jun14 ? 00:02:08 postgres: 40000, primary recovery process gpadmin 26673 26623 0 Jun14 ? 00:00:04 postgres: 40000, stats collector process gpadmin 26674 26623 0 Jun14 ? 00:00:32 postgres: 40000, writer process gpadmin 26675 26623 0 Jun14 ? 00:00:10 postgres: 40000, checkpointer process gpadmin 26676 26623 0 Jun14 ? 00:00:11 postgres: 40000, sweeper process gpadmin 26677 26623 0 Jun14 ? 00:02:37 postgres: 40000, stats sender process gpadmin 26678 26623 0 Jun14 ? 00:00:27 postgres: 40000, wal writer processI have highlighted the primary sender process. If you check its connection, you'll see that it there is a connection established from the primary to the mirror segment using the mirror's replication port:
segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ netstat -lanp | grep 26660 tcp 0 0 10.193.102.204:35939 10.193.102.205:51000 ESTABLISHED 26660/postgres: 400 segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ cd /proc/26660/fd segment/097084cd-1f8c-4f21-bee5-099986e44491:/proc/26660/fd$ ls -l total 0 lr-x------ 1 gpadmin gpadmin 64 Jun 18 14:55 0 -> /dev/null l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 1 -> pipe:[19339852] l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 2 -> pipe:[19339852] lr-x------ 1 gpadmin gpadmin 64 Jun 18 14:55 3 -> pipe:[19339890] l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 4 -> pipe:[19339890] lrwx------ 1 gpadmin gpadmin 64 Jun 18 14:55 5 -> socket:[19339902]A
strace
of the Primary Sender Process will show the following entries:
26660 15:07:11.071428 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000051> 26660 15:07:11.123031 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000028> 26660 15:07:11.174542 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000033> 26660 15:08:11.292755 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000044> 26660 15:08:11.344231 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000027> 26660 15:08:11.395684 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000024> 26660 15:09:11.516287 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000035> 26660 15:09:11.567447 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000027> 26660 15:09:11.618616 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000032>It appears this process will send a burst of 3 heartbeats to the mirror every 60 second. On the flip side, we can see that the mirror receiver process is responsible for sending a
recvfrom
signal to the primary as acknowledgment:
gpadmin 21798 1 0 15:56 ? 00:00:00 /var/vcap/store/gpadmin/greenplum-db-5.17.0/bin/postgres -D /data1/mirror/gpseg0 -p 50000 --gp_dbid=6 --gp_num_contents_in_cluster=4 --silent-mode=true -i -M quiescent --gp_contentid=0 gpadmin 21803 21798 0 15:56 ? 00:00:00 postgres: 50000, logger process gpadmin 21817 21798 0 15:56 ? 00:00:00 postgres: 50000, mirror process gpadmin 21821 21817 0 15:56 ? 00:00:00 postgres: 50000, mirror receiver process gpadmin 21822 21817 0 15:56 ? 00:00:00 postgres: 50000, mirror consumer process gpadmin 21823 21817 0 15:56 ? 00:00:00 postgres: 50000, mirror consumer writer process gpadmin 21824 21817 0 15:56 ? 00:00:00 postgres: 50000, mirror consumer append only process gpadmin 21825 21817 0 15:56 ? 00:00:00 postgres: 50000, mirror sender ack process gpadmin 22354 22307 0 16:05 pts/0 00:00:00 grep --color=auto 50000 segment/1f9ec963-286e-410f-bff1-47da3428fb3e:~$ netstat -lanp | grep 21821 tcp 0 0 10.193.102.205:51000 0.0.0.0:* LISTEN 21821/postgres: 500 tcp 0 0 10.193.102.205:51000 10.193.102.204:59457 ESTABLISHED 21821/postgres: 500 segment/1f9ec963-286e-410f-bff1-47da3428fb3e:~$ cd /proc/21821/fd segment/1f9ec963-286e-410f-bff1-47da3428fb3e:/proc/21821/fd$ ls -l total 0 lr-x------ 1 gpadmin gpadmin 64 Jun 18 16:06 0 -> /dev/null l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 1 -> pipe:[23443481] l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 2 -> pipe:[23443481] lr-x------ 1 gpadmin gpadmin 64 Jun 18 16:06 3 -> pipe:[23443509] l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 4 -> pipe:[23443509] lrwx------ 1 gpadmin gpadmin 64 Jun 18 16:06 5 -> socket:[23443512] lrwx------ 1 gpadmin gpadmin 64 Jun 18 16:06 6 -> socket:[23443535]A
strace
of the Mirror Receiver Process shows the heartbeat was successfully received:
21821 16:15:24.019757 recvfrom(6, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000012> 21821 16:15:24.071073 recvfrom(6, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000009> 21821 16:15:24.122254 recvfrom(6, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000011>You will see the same type of activity between the "
mirror sender ack process
" (mirror) and the "primary receiver ack process
".