Mirror segments are continually failing or "interconnect encountered a network error"
search cancel

Mirror segments are continually failing or "interconnect encountered a network error"

book

Article ID: 296550

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Error message "interconnect encountered a network error" is frequently reported

and/or

Mirror segments keep failing, especially under heavy load.
The mirror log reports:
2022-05-16 16:39:23.898024 +03,,,p36705,th-1798047872,,,,0,,,seg-1,,,,,"WARNING","08006","mirror failure, could not sent message to mirror msg header count '657925' local count '657925' : Connection reset by peer, failover requested","identifier 'base/16346565/16333637.2' operation 
'write' relation type 'buffer pool' message count '657925'","run gprecoverseg to re-establish mirror connectivity",,,"mirroring role 'primary role' mirroring state 'resync' segment state 'up and running' process name(pid) 'primary sender process(36705)' filerep state 'up and running
' 
position begin '0x7f6a84c15040' position end '0x7f6a85415100' position insert '0x7f6a85189dd0' position consume '0x7f6a8518abd8' position wraparound '0x7f6a8540d060' insert count '658159' consume count '657925' ",,0,,"cdbfilerepprimary.c",1772,
2022-05-16 16:39:23.968007 +03,,,p36704,th-1798047872,,,2000-01-01 03:00:00 +03,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'resync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(367
04)' 'cdbfilerep.c' 'L2444' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1824,"Stack trace:
1    0x9659eb postgres errstart (elog.c:521)
2    0xa178d4 postgres FileRep_InsertConfigLogEntryInternal (cdbfilerep.c:1806)
3    0xa1af5c postgres FileRepSubProcess_SetState (cdbfilerepservice.c:555)
4    0xa1b232 postgres FileRepSubProcess_ProcessSignals (cdbfilerepservice.c:281)
5    0xa2952d postgres <symbol not found> (cdbfilerepprimaryack.c:260)
6    0xa299c9 postgres FileRepAckPrimary_StartReceiver (cdbfilerepprimaryack.c:232)
7    0xa1b875 postgres FileRepSubProcess_Main (cdbfilerepservice.c:834)
8    0xa15d06 postgres <symbol not found> (cdbfilerep.c:2667)
9    0xa1a99c postgres FileRep_Main (cdbfilerep.c:3549)
10   0x58a733 postgres AuxiliaryProcessMain (bootstrap.c:513)
11   0x7d841b postgres <symbol not found> (postmaster.c:7395)
12   0x7dcd9c postgres StartFilerepProcesses (postmaster.c:1622)
13   0x7e6699 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1760)
14   0x7e1461 postgres <symbol not found> (postmaster.c:2465)
15   0x7e36aa postgres PostmasterMain (postmaster.c:1533)
16   0x4cdbe7 postgres main (main.c:206)
17   0x7f6a90243545 libc.so.6 __libc_start_main + 0xf5
18   0x4ce19c postgres <symbol not found> + 0x4ce19c
"
2022-05-16 16:39:23.968217 +03,,,p36704,th-1798047872,,,2000-01-01 03:00:00 +03,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'resync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(367
04)' 'cdbfilerepservice.c' 'L574' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1824,


Environment

Product Version: 5.28

Resolution

Note: It is usually possible to make the following changes while the database is up, however, it is advised that the DB is shutdown while making any of the changes below. The following steps should be taken on ALL hosts in the cluster.
 

LRO and GRO settings for bond0

Check the lro (large-receive-offload) and gro (generic-receive-offload) settings for interface used for the interconnect. The examples below assume "bond0" as the interconnect interface. This may be different on each cluster and possibly on each host in the cluster.
ethtool -k bond0
ethtool -k bond0 | egrep 'generic-receive-offload|large-receive-offload'
The lro (large-receive-offload) should be "on" and gro (generic-receive-offload) should be "off"
To change the settings live on the host, login as "root" user and run:
ethtool -K bond0 lro on gro off
To make the settings persistent after a reboot, add the following line to the file /etc/sysconfig/network-scripts/ifcfg-bond0
ETHTOOL_OPTS="-K ${DEVICE} lro on gro off"

Ring buffer sizes on NICs for bond0

Find the NICs used in bond0:
cd /etc/sysconfig/network-scripts/
egrep 'MASTER=bond0' ifcfg-*
or run
ifconfig -a | egrep -B3 $(ifconfig bond0 | egrep ether | awk '{print $2}')
Once the NICs are known, check the ring buffer sizes with
ethtool -g <NIC>
Note: replace <NIC> with the appropriate NIC name.
Verify the "Current hardware settings" in the output. RX and TX should be set to the "Pre-set maximums".

To change the settings live on the NIC, log in as "root" user and run:
ethtool -G <NIC> tx 4096 rx 4096
Note: replace <NIC> with the appropriate NIC name.

To make the setting persistent after a reboot:
Add the following line the the files /etc/sysconfig/network-scripts/ifcfg-<NIC> :
ETHTOOL_OPTS="-G ${DEVICE} rx 4096 tx 4096"