When attempting to run any Greenplum utility (such as gpstart, gpstop, gprecoverseg, gpexpand, etc.) the tool exits with a non-zero ssh/scp return code for one or more commands. It might be necessary to run the utility with -v (verbose) option to determine the command that failed.
Example as seen in gprecoverseg:
gprecoverseg:sdw1:gpadmin-[DEBUG]:-[worker11] finished cmd: Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' sdw1 ". /greenplum/greenplum-db/./greenplum_path.sh; $GPHOME/bin/gp_primarymirror -h sdw1 -p 40003"' had result: cmd had rc=255 completed=True halted=False stdout='' stderr=''
Example as seen in gpexpand:
gpexpand:mdw:gpadmin-[ERROR]:-gpexpand failed. exiting... Traceback (most recent call last): File "/greenplum/greenplum-db/./bin/gpexpand", line 3088, in <module> gp_expand.update_original_segments() File "/greenplum/greenplum-db/./bin/gpexpand", line 1543, in update_original_segments raise ExpansionError('Failed to configure original segments: %s' % msg) ExpansionError: Failed to configure original segments: ExecutionError: 'Error Executing Command: ' occured. Details: 'GPSTART_INTERNAL_MASTER_ONLY=1 /usr/bin/scp -o 'StrictHostKeyChecking no' -r /data/master/gpexpand_DDMMYYYY/pg_hba.conf sdwX:/data/gpsegY' cmd had rc=1 completed=True halted=False stdout='' stderr='lost connection
As we can see in both cases above, certain ssh/scp commands returned a non-zero response. An ssh command returned rc=255 and empty stderr in the first case an scp command returned rc=1 and stderr='lost connection...' in the second case.
An ssh command exits with the exit status of the remote command (0 if successful) or a value 255 if an error occurred while processing request via ssh session [1]. In a similar way, scp will return a >0 code, if the operation was not successful [2].
When these return codes are returned, it is worth checking /var/log/secure to know more about what caused the problem. Even with not much evidence under /var/log/secure, in many cases, it has to do with the amount of parallel ssh/scp sessions opened by Greenplum tools and their timeouts being too short.
There are two ways to resolve this issue:
(Recommended) Use a smaller -B option in the specific utility to decrease the number of parallel processes that are spawned with it. This value can be quite big by default (i.e. 60 in gpstate or 16 in gpexpand).
gpexpand -B 8 -i input_file (This reduces the number of parallel processes to 8 in gpexpand) gpstate -B 30 (This reduces the number of parallel processes to 30 in gpstate)
Modify the system ssh/scp settings and increase the maximum number of concurrent sessions. MaxStartups can take the following format XX:YY:ZZ. In this format XX is the number of unauthenticated connections before we start dropping, YY is the percentage chance of dropping once we reach XX and ZZ is the maximum number of connections at which we start dropping everything [3].