Steps to check when gprecoverseg utility is terminated on the master host.
Scenario:
Consider a scenario when you are running Incremental recovery (gprecoverseg) or Full recovery (gprecoverseg -F) and you end up losing the connection to your master host. If you are not running the utility in nohup mode, it is possible that the gprecoverseg utility will be terminated and you will not see it running on the master the next time you login (Verify using: ps -ef | grep gprecoverseg)
Confirm if gprecoverseg is not running on the master host. The above below leave behind orphaned sessions (pg_basebackup or pg_rewind) on the segments that should be verified before starting another run of gprecoverseg.
[gpadmin@gpdbnew-m gpAdminLogs]$ ps -ef | grep gprecoverseg gpadmin 25014 24966 0 16:57 pts/1 00:00:00 grep --color=auto gprecoverseg
In this case, gpssh to all the nodes and confirm the following:
[gpadmin@gpdbnew-m ~]$ gpssh -f hostfile => ps -ef | grep pg_basebackup [gpdbnew-1] gpadmin 16297 16263 0 17:00 pts/1 00:00:00 grep --color=auto pg_basebackup [gpdbnew-2] gpadmin 11844 11819 0 17:00 pts/0 00:00:00 grep --color=auto pg_basebackup [gpdbnew-m] gpadmin 25263 25229 0 17:00 pts/5 00:00:00 grep --color=auto pg_basebackup => ps -ef | grep pg_rewind [gpdbnew-1] gpadmin 16306 16263 0 17:00 pts/1 00:00:00 grep --color=auto pg_rewind [gpdbnew-2] gpadmin 11846 11819 0 17:00 pts/0 00:00:00 grep --color=auto pg_rewind [gpdbnew-m] gpadmin 25272 25229 0 17:00 pts/5 00:00:00 grep --color=auto pg_rewind
Check if there are any processes running pg_basebackup or pg_rewind specifically on the nodes where the segments to be recovered are located. The information about nodes that are participating in the the recovery can be gathered by querying the "gp_segment_configuration" table.
**Note**
If the orphaned pg_basebackup or pg_rewind sessions are not running on the specific nodes (like in the example shown above), please feel free to rerun the gprecoverseg utility again.
If there are orphaned pg_basebackup or pg_rewind sessions running on these nodes, you will see something like the following:
[gpadmin@gpdbnew-1 ~]$ ps -ef | grep pg_basebackup gpadmin 15425 15424 0 16:51 ? 00:00:00 bash -c . /usr/local/6.19.4/greenplum-db-6.19.4/greenplum_path.sh; /usr/local/6.19.4/greenplum-db-6.19.4/bin/lib/gpconfigurenewsegment -c "/data/primary/gp_6.19.4_202204041858361:30007:false:false:3:1:gpdbnew-2:35007:/home/gpadmin/gpAdminLogs/pg_basebackup.20220422_165117.dbid3.out" -l /home/gpadmin/gpAdminLogs -n -b 64 --validation-only --force-overwrite gpadmin 15439 15425 0 16:51 ? 00:00:00 python /usr/local/6.19.4/greenplum-db-6.19.4/bin/lib/gpconfigurenewsegment -c /data/primary/gp_6.19.4_202204041858361:30007:false:false:3:1:gpdbnew-2:35007:/home/gpadmin/gpAdminLogs/pg_basebackup.20220422_165117.dbid3.out -l /home/gpadmin/gpAdminLogs -n -b 64 --validation-only --force-overwrite
The above is an example of orphaned pg_basebackup session that was triggered by the gprecoverseg utility when doing full recovery.
Steps to take when there are orphaned 'pg_basebackup' or 'pg_rewind' sessions on the segment nodes:
Terminate the above sessions using a 'kill' on the pid. Please make sure that these sessions correspond to the segment for which full/incremental recovery was triggered and whose gprecoverseg utility was terminated on the master.
Once the 'pg_basebackup' or 'pg_rewind' are terminated on the segment, run a "ps -ef | grep pg_basebackup" or "ps -ef | grep pg_rewind" again to confirm if they are gone on the segment hosts.
If yes, ssh to the master and then run a full recovery for these segments [gprecoverseg -F]. Note that if an incremental recovery was earlier terminated and pg_rewind sessions had to be manually killed on the segments, you will still need to run a full recovery for these segments.