gprecoverseg fails with "Error occurred: Failed while trying to remove postmaster.pid."
search cancel

gprecoverseg fails with "Error occurred: Failed while trying to remove postmaster.pid."

book

Article ID: 296610

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

If a segment host is down, gprecoverseg will fail with:
Error occurred: Failed while trying to remove postmaster.pid
This error occurs even if there are other segments that can be recovered.

Extract from the gprecoverseg command in this scenario:
20201005:15:46:43:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-3 segment(s) to recover
20201005:15:46:43:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-Ensuring 3 failed segment(s) are stopped
20201005:15:46:46:004172 gprecoverseg:gpdb10:gpadmin-[WARNING]:-Unable to determine if /data6/mirror/gpseg0 is symlink. Assuming it is not symlink
20201005:15:46:52:004172 gprecoverseg:gpdb10:gpadmin-[WARNING]:-Unable to determine if /data6/primary/gpseg1 is symlink. Assuming it is not symlink
20201005:15:46:56:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20201005:15:46:58:004172 gprecoverseg:gpdb10:gpadmin-[ERROR]:-ExecutionError: 'non-zero rc: 255' occurred.  Details: 'ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 gpdb12 ". /usr/local/greenplum-db-6.11.1/greenplum_path.sh; $GPHOME/sbin/gpoperation.py"'  cmd had rc=255 completed=True halted=False
  stdout=''
  stderr='ssh: connect to host gpdb12 port 22: No route to host
'
Traceback (most recent call last):
  File "/usr/local/greenplum-db-6.11.1/lib/python/gppylib/commands/base.py", line 278, in run
    self.cmd.run()
  File "/usr/local/greenplum-db-6.11.1/lib/python/gppylib/operations/__init__.py", line 53, in run
    self.ret = self.execute()
  File "/usr/local/greenplum-db-6.11.1/lib/python/gppylib/operations/utils.py", line 50, in execute
    cmd.run(validateAfter=True)
  File "/usr/local/greenplum-db-6.11.1/lib/python/gppylib/commands/base.py", line 561, in run
    self.validate()
  File "/usr/local/greenplum-db-6.11.1/lib/python/gppylib/commands/base.py", line 609, in validate
    raise ExecutionError("non-zero rc: %d" % self.results.rc, self)
ExecutionError: ExecutionError: 'non-zero rc: 255' occurred.  Details: 'ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 gpdb12 ". /usr/local/greenplum-db-6.11.1/greenplum_path.sh; $GPHOME/sbin/gpoperation.py"'  cmd had rc=255 completed=True halted=False
  stdout=''
  stderr='ssh: connect to host gpdb12 port 22: No route to host
'
20201005:15:46:58:004172 gprecoverseg:gpdb10:gpadmin-[WARNING]:-Unable to clean up shared memory for stopped segments on host (gpdb12)
20201005:15:46:58:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-Updating configuration with new mirrors
20201005:15:46:58:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-Updating mirrors
20201005:15:46:58:004172 gprecoverseg:gpdb10:gpadmin-[INFO]:-Running pg_rewind on required mirrors
20201005:15:47:01:004172 gprecoverseg:gpdb10:gpadmin-[CRITICAL]:-Error occurred: Failed while trying to remove postmaster.pid.
 Command was: 'ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 gpdb12 ". /usr/local/greenplum-db-6.11.1/greenplum_path.sh; rm -f /data6/primary/gpseg1/postmaster.pid"'
rc=255, stdout='', stderr='ssh: connect to host gpdb12 port 22: No route to host
'


Environment

Product Version: 6.10

Resolution

Workaround

  1. Use "gprecoversreg -o /tmp/recover.txt" to create a file listing all failed segments.
  2. Edit the file /tmp/recover.txt with vi, or another editor, to remove all the segments listed in the file that are on the failed host.
  3. Run "gprecoverseg -i /tmp/recover.txt" to recover all segments not associated with the failed host.

Fix

A ticket has been opened with R&D to improve the gprecoverseg command so that it will recover whatever segments that can be recovered and ignore the ones which cannot be recovered.