Greenplum Utilities Commands Fail with SSH Non-Zero Return Code
search cancel

Greenplum Utilities Commands Fail with SSH Non-Zero Return Code

book

Article ID: 295878

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

When attempting to run any Greenplum utility (such as gpstart, gpstop, gprecoverseg, gpexpand, etc.) the tool exits with a non-zero ssh/scp return code for one or more commands. It might be necessary to run the utility with -v (verbose) option to determine the command that failed.

Example as seen in gprecoverseg:

gprecoverseg:sdw1:gpadmin-[DEBUG]:-[worker11] finished cmd: 
Get segment status cmdStr='ssh -o 'StrictHostKeyChecking no' sdw1 ". /greenplum/greenplum-db/./greenplum_path.sh; 
$GPHOME/bin/gp_primarymirror -h sdw1 -p 40003"'  
had result: cmd had rc=255 completed=True halted=False
  stdout=''
  stderr=''

Example as seen in gpexpand:

gpexpand:mdw:gpadmin-[ERROR]:-gpexpand failed. exiting...
Traceback (most recent call last):
   File "/greenplum/greenplum-db/./bin/gpexpand", line 3088, in <module>
     gp_expand.update_original_segments()
   File "/greenplum/greenplum-db/./bin/gpexpand", line 1543, in update_original_segments
     raise ExpansionError('Failed to configure original segments: %s' % msg)
 ExpansionError: Failed to configure original segments: ExecutionError: 'Error Executing Command: ' occured.  Details: 'GPSTART_INTERNAL_MASTER_ONLY=1 /usr/bin/scp -o 'StrictHostKeyChecking no' -r /data/master/gpexpand_DDMMYYYY/pg_hba.conf sdwX:/data/gpsegY'  cmd had rc=1 completed=True halted=False
   stdout=''
   stderr='lost connection

As we can see in both cases above, certain ssh/scp commands returned a non-zero response. An ssh command returned rc=255 and empty stderr in the first case an scp command returned rc=1 and stderr='lost connection...' in the second case.

 

Environment


Cause

An ssh command exits with the exit status of the remote command (0 if successful) or a value 255 if an error occurred while processing request via ssh session [1]. In a similar way, scp will return a >0 code, if the operation was not successful [2].

When these return codes are returned, it is worth checking /var/log/secure to know more about what caused the problem. Even with not much evidence under /var/log/secure, in many cases, it has to do with the amount of parallel ssh/scp sessions opened by Greenplum tools and their timeouts being too short.

 

Resolution

There are two ways to resolve this issue:

  1. (Recommended) Use a smaller -B option in the specific utility to decrease the number of parallel processes that are spawned with it. This value can be quite big by default (i.e. 60 in gpstate or 16 in gpexpand).

    gpexpand -B 8 -i input_file (This reduces the number of parallel processes to 8 in gpexpand)
    gpstate -B 30 (This reduces the number of parallel processes to 30 in gpstate)

    Modify the system ssh/scp settings and increase the maximum number of concurrent sessions. MaxStartups can take the following format XX:YY:ZZ. In this format XX is the number of unauthenticated connections before we start dropping, YY is the percentage chance of dropping once we reach XX and ZZ is the maximum number of connections at which we start dropping everything [3].

    1. Take note of the current value of MaxStartups in /etc/ssh/sshd_config (in all segment servers in the cluster)
    2. Increase the MaxStartups in the /etc/ssh/sshd_config file (in all segment servers in the cluster) following the format explained above.
    3. Restart the ssh daemon on all the servers