Primary or Mirror Filerep in Pivotal Greenplum

search cancel

Primary or Mirror Filerep in Pivotal Greenplum

book

Article ID: 296455

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

There are several articles which go through how the Fault Tolerance Server (FTS) process in the master works to make sure there is a timely response from the primary and mirror segments to the master. You can click on the following link to read more about the FTS (Fault Tolerance) process in the master.

This article will expand on FTS and focus more on primary or mirror fault detection using the filerep process and what to check for on the network side if you are facing issues with segment level connectivity between primary and mirror segments.

When a segments goes down, usually you can check for FTS probe messages in the master log. If there's no such message in the master log and there are no other obvious symptoms, then you may be facing an issue with primary or mirror segment connectivity. You will generally see the following types of messages in the log file:

Master Log

2019-07-10 13:40:29.571788 EDT,,,p13044,th1321363200,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: segment (dbid=2, content=0) reported fault FaultMirror segmentstatus 11 to the prober.",,,,,,,0,,,,
2019-07-10 13:40:29.575816 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=2) reported mirroring fault with mirror (dbid=6), mirror considered to be down.",,,,,,,0,,"ftsfilerep.c",371,
2019-07-10 13:40:29.575859 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=2, content=0) from ('u','p') to ('u','p')",,,,,,,0,,"fts.c",975,
2019-07-10 13:40:29.575878 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: change state for segment (dbid=6, content=0) from ('u','m') to ('d','m')",,,,,,,0,,"fts.c",975,
2019-07-10 13:40:29.575895 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,sx1,"LOG","00000","probeUpdateConfig called for 2 changes",,,,,,,0,,"fts.c",835,
2019-07-10 13:40:29.604825 EDT,,,p13044,th1760114496,,,,0,con2,,seg-1,,,,,"LOG","00000","FTS: primary (dbid=2) on sdw1:40000 transitioning to change-tracking mode, mirror marked as down.",,,,,,,0,,"ftsfilerep.c",509,

Primary Segment Log

2019-07-10 13:40:06.611704 EDT,,,p8887,th907028288,,,1999-12-31 19:00:00 EST,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343,
2019-07-10 13:40:06.748137 EDT,,,p8887,th907028288,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' filerep state 'up and running' process name(pid) 'primary receiver ack process(
8887)' 'cdbfilerep.c' 'L2444' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1824,"Stack trace:
1    0x96016b postgres errstart (elog.c:521)
2    0xa12054 postgres FileRep_InsertConfigLogEntryInternal (cdbfilerep.c:1806)
3    0xa240e7 postgres FileRepAckPrimary_StartReceiver (cdbfilerepprimaryack.c:159)
4    0xa15ff5 postgres FileRepSubProcess_Main (cdbfilerepservice.c:833)
5    0xa10486 postgres <symbol not found> (cdbfilerep.c:2667)
6    0xa1511c postgres FileRep_Main (cdbfilerep.c:3549)
7    0x588336 postgres AuxiliaryProcessMain (bootstrap.c:503)
8    0x7d302b postgres <symbol not found> (postmaster.c:7395)
9    0x7d79ac postgres StartFilerepProcesses (postmaster.c:1622)
10   0x7e12a9 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1760)
11   0x7dc071 postgres <symbol not found> (postmaster.c:2465)
12   0x7de2ba postgres PostmasterMain (postmaster.c:1533)
13   0x4cb1b7 postgres main (main.c:206)
14   0x7f523162eb15 libc.so.6 __libc_start_main + 0xf5
15

2019-07-10 13:40:06.755898 EDT,,,p8890,th907028288,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(8890)' filerep state 'fault' 
position ack begin '0x7f5226a92040' position ack end '0x7f5226b12040' position ack insert '0x7f5226a95118' position ack consume '0x7f5226a95118' position ack wraparound '0x7f5226b12040' insert count ack '128' consume count ack '128' ",,0,,"cdbfilerepprimaryack.c",875,
2019-07-10 13:40:06.755949 EDT,,,p8890,th907028288,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete operation on mirror, failover requested","identifier 'heartBeat' operation 'heart beat' relation type 'control message' message count '-1'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(8890)' filerep state 'fault' position ack begin '0x7f5226a92040' position ack end '0x7f5226b12040' position ack insert '0x7f5226a95118' position ack consume '0x7f5226a95118' position ack wraparound '0x7f5226b12040' insert count ack '128' consume count ack '128' 
position begin '0x7f5226291040' position end '0x7f5226a91100' position insert '0x7f52265cc348' position consume '0x7f52265cc348' position wraparound '0x7f5226a91100' insert count '637' consume count '637' ",,0,,"cdbfilerepprimary.c",1371,

You can run gprecovereg as the message suggests to re-establish mirror connectivity, however it is important to check why the segment went down in the first place, otherwise it may not complete if the issue is still there. You can run into this situation if there are no problems with the network connection between the master and segment hosts which means that the FTS probe process will not detect the issue. Instead, there is a separate File Replication process (filerep) that ensures the primary and mirror segment pair stays in sync.

There are two things to consider when it comes to determining if the primary or mirror pair are no longer in sync and they are controlled by two separate groups of GUCs.

TCP Keepalive (filerep)

gp_filerep_tcp_keepalives_count: How many keepalives may be lost before the connection is considered dead. (Default is 2)
gp_filerep_tcp_keepalives_idle: Number of seconds between sending keepalives on an otherwise idle connection. (Default is 1 minute)
gp_filerep_tcp_keepalives_interval: How many seconds to wait for a response to a keepalive before retransmitting. (Default 30 seconds)
Segment failover time = (gp_filerep_tcp_keepalives_idle) + (gp_filerep_tcp_keepalives_interval * gp_filerep_tcp_keepalives_count)

If we go with the default values then Segment Failover will occur after (1 minute + (30 seconds * 2)) = 2 minutes.

This generally occurs when a TCP connection is unable to receive any data from the destination host in cases where the replication port is blocked or busy. The connection will go into "keepalive" state which means:

1. It will check the current TCP connection to make sure it is still operating.

2. If the current connection does not respond, the client will try to keep the connection alive by sending a keepalive probe.

3. It will look for a response as defined by gp_filerep_tcp_keepalives_interval. If still no response, then It will retry steps 2 and 3 again up to the number of time defined by gp_filerep_tcp_keepalives_count.

4. If it fails and the keepalive count is reached, the TCP connection is deleted and FTS on the master is informed and the mirror is marked down.

If TCP_Keepalive is causing the segment to go down, you will see this in the primary log:

2019-07-10 13:40:06.611704 EDT,,,p8887,th907028288,,,1999-12-31 19:00:00 EST,0,,,seg-1,,,,,"WARNING","XX000","receive close on connection: Success (cdbfilerepconnserver.c:343)",,,,,,,0,,"cdbfilerepconnserver.c",343,

If the gp_filerep GUCs are set to 0, the connection will default to the operating system settings which you can check with this command:

sdw1:~# grep -H . /proc/sys/net/ipv4/tcp_keep*
/proc/sys/net/ipv4/tcp_keepalive_intvl:30
/proc/sys/net/ipv4/tcp_keepalive_probes:8
/proc/sys/net/ipv4/tcp_keepalive_time:120

`gp_segment_connect_timeout`

The gp_segment_connect_timeout can also used to determine primary/mirror connectivity however by default this value is usually set higher than the filerep GUCs so this usually only comes into play if this value is set lower or during recovery operations (gprecoverseg).

[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout
Values on all segments are consistent
GUC          : gp_segment_connect_timeout
Master  value: 10min
Segment value: 10min

Ten minutes is the default value. Usually if there is a network issue, the filerep timeout will be hit first, but if you set gp_segment_connect_timeout to a lower value than filerep (or if TCP Keepalive is disabled), then the gp_segment_connect_timeout will take precedence. You will see messages such as these in the primary segment:

2019-07-10 17:14:52.304062 EDT,,,p703,th-24942784,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=600' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_
segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",837,
2019-07-10 17:17:27.445597 EDT,,,p703,th-24942784,,,,0,,,seg-1,,,,,"WARNING","01000","mirror failure, could not complete mirrored request identifier 'heartBeat' ack state 'waiting for ack', failover requested",,"run gprecoverseg to re-establish mirror connectivity",,
,"mirroring role 'primary role' mirroring state 'sync' segment state 'up and running' process name(pid) 'primary recovery process(703)' filerep state 'up and running' 
position ack begin '0x7fceef1c6040' position ack end '0x7fceef246040' position ack insert '0x7fceef1e9a18' position ack consume '0x7fceef1e9a18' position ack wraparound '0x7fceef246040' insert count ack '410' consume count ack '410' ",,0,,"cdbfilerepprimaryack.c",875

Segment failover time = (heartBeat) + (gp_segment_connect_timeout)

The next section will show what the heartBeat is. A segment pair will send a heartBeat to each other about once per minute. If the sender doesn't hear back from the receiver, then the gp_segment_connect_timeout timer will start.

If the segment goes down due to hitting the gp_segment_connect_timeout, in most cases it will be because the segment_data_directory file system is busy or inaccessible and is not able to send a proper response back. Other possibilities include high load or busy I/O.

Environment

Product Version: 5.20

Resolution

Options:
If a segment does get marked down due to an issue with segment to segment connectivity, here are some potential causes:

1. Dropped packets or corrupt packets. You can run the following command continuously for over a time period to check if there are any increasing packet drops. If the RX-ERR, RX-DROP, TX-ERR, or TX-DRP is showing increases, then it's best to check with your network team:

segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500 34443562      0   2204 0      19468133      0      0      0 BMRU
lo       65536 10080380      0      0 0      10080380      0      0      0 LRU

2. Connecting from primary host to the mirror host or vice versa through the replication port is giving a "Connection refused" or "No route to host". This could indicate that a firewall is blocking the port or some other cause that is preventing the connection from being established. You can use telnet or ncat to check TCP connectivity.

Example:

Success:
segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ nc -v sdw2 51000
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Connected to 10.193.102.205:51000.

Failure:
segment/097084cd-1f8c-4f21-bee5-099986e44491:/tmp$ nc -v sdw2 51000
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: No route to host.

If firewall is not enabled, then sometimes a reboot of the host will solve the issue. If not, it needs further investigation with the network team.

3. Filesystem issue in the segment's data directory. If you cannot do basic things such as listing a file (ls) or creating a file (touch), then this will also cause a segment to go down after hitting gp_segment_connect_timeout GUC value.

4. Missing pg_xlog, pg_clog, pg_distributedlog and/or global directory in the segment data directory.

The following is an example of a properly running primary and mirror sender and receiver process:

Assume for one segment (segment 0 in this example), the following configuration:

gpadmin=# select * from gp_segment_configuration where content = 0 order by 1;
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port 
------+---------+------+----------------+------+--------+-------+----------+---------+------------------
    2 |       0 | p    | p              | s    | u      | 40000 | sdw1     | sdw1    |            41000
    6 |       0 | m    | m              | s    | u      | 50000 | sdw2     | sdw2    |            51000

The primary segment will have the following processes:

segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ ps -ef | grep 40000
gpadmin  20533 20485  0 14:55 pts/0    00:00:00 grep --color=auto 40000
gpadmin  26623     1  0 Jun14 ?        00:00:13 /var/vcap/store/gpadmin/greenplum-db-5.17.0/bin/postgres -D /data1/primary/gpseg0 -p 40000 --gp_dbid=2 --gp_num_contents_in_cluster=4 --silent-mode=true -i -M quiescent --gp_contentid=0
gpadmin  26627 26623  0 Jun14 ?        00:00:06 postgres: 40000, logger process   
gpadmin  26648 26623  0 Jun14 ?        00:00:06 postgres: 40000, primary process   
gpadmin  26659 26648  0 Jun14 ?        00:00:02 postgres: 40000, primary receiver ack process   
gpadmin  26660 26648  0 Jun14 ?        00:00:04 postgres: 40000, primary sender process   
gpadmin  26661 26648  0 Jun14 ?        00:00:01 postgres: 40000, primary consumer ack process   
gpadmin  26662 26648  0 Jun14 ?        00:02:08 postgres: 40000, primary recovery process   
gpadmin  26673 26623  0 Jun14 ?        00:00:04 postgres: 40000, stats collector process   
gpadmin  26674 26623  0 Jun14 ?        00:00:32 postgres: 40000, writer process   
gpadmin  26675 26623  0 Jun14 ?        00:00:10 postgres: 40000, checkpointer process   
gpadmin  26676 26623  0 Jun14 ?        00:00:11 postgres: 40000, sweeper process   
gpadmin  26677 26623  0 Jun14 ?        00:02:37 postgres: 40000, stats sender process   
gpadmin  26678 26623  0 Jun14 ?        00:00:27 postgres: 40000, wal writer process

I have highlighted the primary sender process. If you check its connection, you'll see that it there is a connection established from the primary to the mirror segment using the mirror's replication port:

segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ netstat -lanp | grep 26660
tcp        0      0 10.193.102.204:35939    10.193.102.205:51000    ESTABLISHED 26660/postgres: 400 

segment/097084cd-1f8c-4f21-bee5-099986e44491:~$ cd /proc/26660/fd
segment/097084cd-1f8c-4f21-bee5-099986e44491:/proc/26660/fd$ ls -l
total 0
lr-x------ 1 gpadmin gpadmin 64 Jun 18 14:55 0 -> /dev/null
l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 1 -> pipe:[19339852]
l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 2 -> pipe:[19339852]
lr-x------ 1 gpadmin gpadmin 64 Jun 18 14:55 3 -> pipe:[19339890]
l-wx------ 1 gpadmin gpadmin 64 Jun 18 14:55 4 -> pipe:[19339890]
lrwx------ 1 gpadmin gpadmin 64 Jun 18 14:55 5 -> socket:[19339902]

A strace of the Primary Sender Process will show the following entries:

26660 15:07:11.071428 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000051>
26660 15:07:11.123031 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000028>
26660 15:07:11.174542 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000033>
26660 15:08:11.292755 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000044>
26660 15:08:11.344231 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000027>
26660 15:08:11.395684 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000024>
26660 15:09:11.516287 sendto(5, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000035>
26660 15:09:11.567447 sendto(5, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000027>
26660 15:09:11.618616 sendto(5, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0\340\16\333\337\377\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 4161, MSG_NOSIGNAL, NULL, 0) = 4161 <0.000032>

It appears this process will send a burst of 3 heartbeats to the mirror every 60 second. On the flip side, we can see that the mirror receiver process is responsible for sending a recvfrom signal to the primary as acknowledgment:

gpadmin  21798     1  0 15:56 ?        00:00:00 /var/vcap/store/gpadmin/greenplum-db-5.17.0/bin/postgres -D /data1/mirror/gpseg0 -p 50000 --gp_dbid=6 --gp_num_contents_in_cluster=4 --silent-mode=true -i -M quiescent --gp_contentid=0
gpadmin  21803 21798  0 15:56 ?        00:00:00 postgres: 50000, logger process   
gpadmin  21817 21798  0 15:56 ?        00:00:00 postgres: 50000, mirror process   
gpadmin  21821 21817  0 15:56 ?        00:00:00 postgres: 50000, mirror receiver process   
gpadmin  21822 21817  0 15:56 ?        00:00:00 postgres: 50000, mirror consumer process   
gpadmin  21823 21817  0 15:56 ?        00:00:00 postgres: 50000, mirror consumer writer process   
gpadmin  21824 21817  0 15:56 ?        00:00:00 postgres: 50000, mirror consumer append only process   
gpadmin  21825 21817  0 15:56 ?        00:00:00 postgres: 50000, mirror sender ack process   
gpadmin  22354 22307  0 16:05 pts/0    00:00:00 grep --color=auto 50000

segment/1f9ec963-286e-410f-bff1-47da3428fb3e:~$ netstat -lanp | grep 21821
tcp        0      0 10.193.102.205:51000    0.0.0.0:*               LISTEN      21821/postgres: 500 
tcp        0      0 10.193.102.205:51000    10.193.102.204:59457    ESTABLISHED 21821/postgres: 500 

segment/1f9ec963-286e-410f-bff1-47da3428fb3e:~$ cd /proc/21821/fd
segment/1f9ec963-286e-410f-bff1-47da3428fb3e:/proc/21821/fd$ ls -l
total 0
lr-x------ 1 gpadmin gpadmin 64 Jun 18 16:06 0 -> /dev/null
l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 1 -> pipe:[23443481]
l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 2 -> pipe:[23443481]
lr-x------ 1 gpadmin gpadmin 64 Jun 18 16:06 3 -> pipe:[23443509]
l-wx------ 1 gpadmin gpadmin 64 Jun 18 16:06 4 -> pipe:[23443509]
lrwx------ 1 gpadmin gpadmin 64 Jun 18 16:06 5 -> socket:[23443512]
lrwx------ 1 gpadmin gpadmin 64 Jun 18 16:06 6 -> socket:[23443535]

A strace of the Mirror Receiver Process shows the heartbeat was successfully received:

21821 16:15:24.019757 recvfrom(6, "1\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000012>
21821 16:15:24.071073 recvfrom(6, "3\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000009>
21821 16:15:24.122254 recvfrom(6, "2\0\0\20@\0\0\0\0\5\0\0\0\f\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0\0`\25\354\275\374\177\0\0\200\277\241\0\0\0\0\0heartBeat\0\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 4161 <0.000011>

You will see the same type of activity between the "mirror sender ack process" (mirror) and the "primary receiver ack process".

Additional Information

https://knowledge.broadcom.com/external/article?articleNumber=296248

Feedback

thumb_up Yes

thumb_down No