gprecoverseg fails to start segment with "ssh_exchange_identification: Connection closed by remote host" in Tanzu Greenplum
search cancel

gprecoverseg fails to start segment with "ssh_exchange_identification: Connection closed by remote host" in Tanzu Greenplum

book

Article ID: 296703

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

gprecoverseg fails to start segment and stderr='ssh_exchange_identification: Connection closed by remote host':
[gpadmin@mdw 285185]$  gprecoverseg -B 1  -a -i ../rec.out
20210723:13:59:49:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Starting gprecoverseg with args: -B 1 -a -i ../rec.out
20210723:13:59:49:052349 gprecoverseg:mdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.17.0 build commit:9b887d27cef94c03ce3a3e63e4f6eefb9204631b'
20210723:13:59:49:052349 gprecoverseg:mdw:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.17.0 build commit:9b887d27cef94c03ce3a3e63e4f6eefb9204631b) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jul  7 2021 03:04:37'
20210723:13:59:49:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Greenplum instance recovery parameters
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Recovery from configuration -i option supplied
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Recovery 1 of 1
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Synchronization mode                 = Incremental
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance host                 = sdw1
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance address              = sdw1-2
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance directory            = /data10/mirror/gpseg9
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance port                 = 41009
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance host        = sdw2
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance address     = sdw2-1
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance directory   = /data10/primary/gpseg9
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance port        = 40009
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Target                      = in-place
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------
20210723:14:00:18:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Starting to create new pg_hba.conf on primary segments
=========================================================================
Use of this computer system is for authorized and management approved use
only. All usage is subject to monitoring. Unauthorized use is strictly
prohibited and subject to prosecution and/or corrective action up to and
including termination of employment.
=========================================================================
=========================================================================
Use of this computer system is for authorized and management approved use
only. All usage is subject to monitoring. Unauthorized use is strictly
prohibited and subject to prosecution and/or corrective action up to and
including termination of employment.
=========================================================================
20210723:14:00:19:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Successfully modified pg_hba.conf on primary segments to allow replication connections
20210723:14:00:19:052349 gprecoverseg:mdw:gpadmin-[INFO]:-1 segment(s) to recover
20210723:14:00:19:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Ensuring 1 failed segment(s) are stopped
20210723:14:00:20:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20210723:14:00:20:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Updating configuration with new mirrors
20210723:14:00:20:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Updating mirrors
20210723:14:00:20:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Running pg_rewind on failed segments
sdw1 (dbid 779): skipping pg_rewind on mirror as recovery.conf is present
20210723:14:00:22:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Starting mirrors
20210723:14:00:22:052349 gprecoverseg:mdw:gpadmin-[INFO]:-era is e285c4edd87917b7_210719221035
20210723:14:00:22:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
.................................
20210723:14:00:55:052349 gprecoverseg:mdw:gpadmin-[INFO]:-Process results...
20210723:14:00:55:052349 gprecoverseg:mdw:gpadmin-[WARNING]:-Failed to start segment.  The fault prober will shortly mark it as down. Segment: sdw1:/data10/mirror/gpseg9:content=9:dbid=779:role=m:preferred_role=m:mode=n:status=d: REASON: cmd had rc=255 completed=True halted=False
  stdout=''
  stderr='ssh_exchange_identification: Connection closed by remote host'
You have new mail in /var/spool/mail/gpadmin


Environment

Product Version: 6.16

Resolution

When running gprecoverseg with verbose (-v) mode, you see the same error at the end:
20210723:13:58:13:001436 gprecoverseg:mdw:gpadmin-[DEBUG]:-Running Command: $GPHOME/sbin/gpsegstart.py -M mirrorless -V 'postgres (Greenplum Database) 6.17.0 build commit:9b887d27cef94c03ce3a3e63e4f6eefb9204631b' -n 768 --era e285c4edd87917b7_210719221035 -t 600 -v -D '779|9|m|m|n|d|sdw1|sdw1-2|41009|/data10/mirror/gpseg9' -b 64
.................................20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[DEBUG]:-[worker0] finished cmd: remote segment starts on host 'sdw1' cmdStr='ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 sdw1-2 ". /usr/local/greenplum-db-6.17.0/greenplum_path.sh; $GPHOME/sbin/gpsegstart.py -M mirrorless -V 'postgres (Greenplum Database) 6.17.0 build commit:9b887d27cef94c03ce3a3e63e4f6eefb9204631b' -n 768 --era e285c4edd87917b7_210719221035 -t 600 -v -D '779|9|m|m|n|d|sdw1|sdw1-2|41009|/data10/mirror/gpseg9' -b 64"'  had result: cmd had rc=255 completed=True halted=False
  stdout=''
  stderr='ssh_exchange_identification: Connection closed by remote host
'

20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[INFO]:-Process results...
20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[WARNING]:-Failed to start segment.  The fault prober will shortly mark it as down. Segment: sdw1:/data10/mirror/gpseg9:content=9:dbid=779:role=m:preferred_role=m:mode=n:status=d: REASON: cmd had rc=255 completed=True halted=False
  stdout=''
  stderr='ssh_exchange_identification: Connection closed by remote host
'
20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[DEBUG]:-WorkerPool haltWork()
20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[DEBUG]:-[worker0] haltWork
20210723:13:58:46:001436 gprecoverseg:mdw:gpadmin-[DEBUG]:-[worker0] got a halt cmd

However, when you SSH to the host, it works as expected: 
[gpadmin@mdw]$  ping sdw1 -c 4
PING sdw1 (10.130.2.2) 56(84) bytes of data.
64 bytes from sdw1 (10.130.2.2): icmp_seq=1 ttl=64 time=0.134 ms
64 bytes from sdw1 (10.130.2.2): icmp_seq=2 ttl=64 time=0.109 ms
64 bytes from sdw1 (10.130.2.2): icmp_seq=3 ttl=64 time=0.095 ms
64 bytes from sdw1 (10.130.2.2): icmp_seq=4 ttl=64 time=0.108 ms

--- sdw1 ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms

When checking the gprecoveseg verbose output, you see that the ssh command used by gprecoverseg is actually using the address from gp_segment_configuration, which is sdw2-1 and not a hostname form gp_segment_configuration.

If you test the address from gp_segment_configuration, which is sdw2-1, we can see that the ping doesn't work and it is using a different subnet.
[gpadmin@mdw]$  ping  sdw1-2  -c 4
PING sdw1-2 (10.130.4.2) 56(84) bytes of data.
From sdw1-2 (10.130.4.2) icmp_seq=1 Destination Host Unreachable
From sdw1-2 (10.130.4.2) icmp_seq=2 Destination Host Unreachable
From sdw1-2 (10.130.4.2) icmp_seq=3 Destination Host Unreachable
From sdw1-2 (10.130.4.2) icmp_seq=4 Destination Host Unreachable

--- sdw1-2 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms


To resolve this issue, fix the address from gp_segment_configuration and confirm the change with ssh. Once ssh is working with the correct address, the gprecoverseg should finish successfully as well.