gprecoverseg timing out after 10 mins in Greenplum
search cancel

gprecoverseg timing out after 10 mins in Greenplum

book

Article ID: 295377

calendar_today

Updated On:

Products

VMware Tanzu Greenplum Greenplum Pivotal Data Suite Non Production Edition VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

Symptoms:
When running gprecoverseg to recover a down segment, it can time out and fail during the "commencing parallel primary conversion" stage:
20190522:13:14:41:016319 gprecoverseg:OS2DRHGPLUM03:gpadmin2-[INFO]:-Updating primaries
20190522:13:14:41:016319 gprecoverseg:OS2DRHGPLUM03:gpadmin2-[INFO]:-Commencing parallel primary conversion of 4 segments, please wait...
20190522:13:24:43:016319 gprecoverseg:OS2DRHGPLUM03:gpadmin2-[INFO]:-Process results...
20190522:13:24:43:016319 gprecoverseg:OS2DRHGPLUM03:gpadmin2-[WARNING]:-Failed to inform primary segment of updated mirroring state.  Segment: OS2DRHGPLUM02:/DATA1/greenplumdb/mirror/gpseg5:content=5:dbid=15:mode=r:status=u: REASON: Conversion failed.  stdout:""  stderr:"failure: Error: MirroringFailure failure: Error: MirroringFailure "

Environment


Cause

Notice how gprecoverseg fails after 10 minutes. This is because we are hitting the gp_segment_connect_timeout GUC. By default this is set to 10 minutes.

[gpadmin@mdw ~]$ gpconfig -s gp_segment_connect_timeout
Values on all segments are consistent
GUC          : gp_segment_connect_timeout
Master  value: 10min
Segment value: 10min

For a more detailed explanation on the GUC and possible causes for why this timeout value can be reached, you can refer to these articles:

Resolution

In general, it is not recommended to increase gp_segment_connect_timeout, but rather to find the underlying cause for why this timeout value is getting reached. Possible causes include:

1. Network is slow or unreachable. You can check to make sure TCP connection between primary segment and mirror segment through the replication port is successful. You can use telnet or ncat utility:
nc -vz <host> <port> 
2. The server has high load and is not responding.

3. The segment data directory filesystem is unreachable or unresponsive.