Error "could not receive data from WAL stream, server closed the connection unexpectedly"
search cancel

Error "could not receive data from WAL stream, server closed the connection unexpectedly"

book

Article ID: 295723

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

The following error message is outputted from the standby master pg_log:

2017-12-07 16:19:21.179732 GMT+5,,,p82774,th-1443289280,,,,0,,,seg-1,,,,,"ERROR","XX000","could not receive data from WAL stream: server closed the connection unexpectedly (gp_libpqwalreceiver.c:371)"," 
This probably means the server terminated abnormally 
before or while processing the request.",,,,,,0,,"gp_libpqwalreceiver.c",371,"Stack trace: 
1 0xb0a80e postgres errstart + 0x4de 
2 0x931354 postgres walrcv_receive + 0x2b4 
3 0x936d5b postgres WalReceiverMain + 0x45b 
4 0x5f7068 postgres AuxiliaryProcessMain + 0x818 
5 0x8ef2f4 postgres <symbol not found> + 0x8ef2f4 
6 0x8f315c postgres <symbol not found> + 0x8f315c 
7 0x7fe2a64537e0 libpthread.so.0 <symbol not found> + 0xa64537e0 
8 0x7fe2a59eb503 libc.so.6 __select + 0x13 
9 0x8f9178 postgres <symbol not found> + 0x8f9178 
10 0x8fc4c0 postgres PostmasterMain + 0xff0 
11 0x7fdc9f postgres main + 0x44f 
12 0x7fe2a5928d1d libc.so.6 __libc_start_main + 0xfd 
13 0x4bfeb9 postgres <symbol not found> + 0x4bfeb9

The below error message is outputted from the master pg_log:

2017-12-08 14:14:17.582087 EST,"gpadmin",,p480717,th1272289088,"172.28.8.251","30785",2017-12-08 14:10:17 EST,0,con85,,seg-1,,,,,"LOG","00000","terminating walsender process due to replication timeout",,,,,,,0,,"walsender.c",703,

Environment


Cause

For Greenplum Database (GPDB) master mirroring, the replication_timeout GUC sets the maximum wait time in milliseconds that the walsender process on the active master waits for a status message from the walreceiver process on the standby master.

If a message is not received, the walsender logs an error message.

Resolution

In some cases, a viable workaround would be to increase the value of the replication_timeout GUC (in milliseconds). Note that the default value for this GUC is one minute.


1. Check the current value of the GUC:

$ gpconfig -s replication_timeout
Values on all segments are consistent
GUC          : replication_timeout
Master  value: 1min
Segment value: 1min

2. Next, change the value to a higher value. For example, 90 seconds:

$ gpconfig -c replication_timeout -v 90000ms
20171208:15:26:18:014230 gpconfig:mdw:gpadmin-[INFO]:-completed successfully with parameters '-c replication_timeout -v 90000ms'

3. Issue a gpstop -u to pick up the changes during runtime:

$ gpstop -u
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Starting gpstop with args: -u
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Gathering information and validating the environment...
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Greenplum Version: 'postgres (Greenplum Database) 4.3.18.0 build 1'
20171208:15:26:23:014549 gpstop:mdw:gpadmin-[INFO]:-Signalling all postmaster processes to reload

4. Finally, check to make sure the changes are successful:

$ gpconfig -s replication_timeout
Values on all segments are consistent
GUC          : replication_timeout
Master  value: 90s
Segment value: 90s

Note: The above workaround may not always work for this particular error. Please reach out to Pivotal Support for further assistance if the above workaround does not help you with this error.