Full Recovery hangs on the function gp_remove_segment or gp_add_segment because of locks in Tanzu Greenplum
search cancel

Full Recovery hangs on the function gp_remove_segment or gp_add_segment because of locks in Tanzu Greenplum

book

Article ID: 296574

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

When you run gprecoverseg full [gprecoverseg -F], it is stuck in the following stage:
20190523:18:53:52:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-1 segment(s) to recover 20190523:18:53:52:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Ensuring 1 failed segment(s) are stopped 
20190523:18:53:53:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments 
20190523:18:53:58:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Cleaning files from 1 segment(s) . 
20190523:18:53:59:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Building template directory 
20190523:18:54:01:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Validating remote directories ...... 
20190523:18:54:07:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Copying template directory file . 
20190523:18:54:08:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Configuring new segments ...... 
20190523:18:54:14:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Cleaning files . 20190523:18:54:15:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Starting file move procedure for multi-1:/data/primary/gp_4.3.30.3_201905212054371:content=1:dbid=3:mode=s:status=d updating flat files 
20190523:18:54:15:005066 gprecoverseg:multi-m:gpadmin-[INFO]:-Updating configuration with new mirrors

When checking the PID in pg_stat_activity, we see that it is "waiting=t":
gpadmin@[local])[template1]> select * from pg_stat_activity where procpid=193944; datid | datname | procpid | sess_id | usesysid | usename | current_query | waiting | query_start | b ackend_start | client_addr | client_port | application_name | xact_start | waiting_reason | rsgid | rsgname | rsgqueueduration -------+-----------+---------+---------+----------+---------+-------------------------------------------+---------+-------------------------------+---------- ---------------------+-------------+-------------+------------------+-------------------------------+----------------+-------+---------+------------------ 
1 | template1 | 232324 | 133 | 10 | gpadmin | SELECT gp_remove_segment_mirror(45::int2) | t | 2019-05-18 14:59:54.560121-04 | 2019-05-1 8 14:58:58.666301-04 | 127.0.0.1 | 38246 | | 2019-05-18 14:58:58.725615-04 | | 0 | unknown |
(1 row)


Note: Make sure that the waiting column is showing as "t" in this case.

Environment

Product Version: 5.28

Resolution

The reason for this session waiting can be found out in the pg_locks table. 

In this case, the gp_segment_configuration table is being locked by a gpbackup process that is running when the full recovery was run:
gpadmin=# select * from pg_locks where relation=5036; 
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | mppsessionid | mppiswr iter | gp_segment_id 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+-----------------+---------+--------------+-------- -----+--------------- 
relation | 0 | 5036 | | | | | | | | 3/71 | 4309 | AccessShareLock | t | 24 | t | -1 (1 row) 

In pg_locks, you also notice that the following query shows gprecoverseg is waiting.

Note: The locktype may vary.
gpadmin=# select * from pg_locks where relation=5036;
   locktype    | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |        mode         | granted | mppsession
id | mppiswriter | gp_segment_id 
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+-----------
---+-------------+---------------
 resynchronize |        0 |     5036 |      |       |            |               |         |       |          | 5/0                | 20927 | AccessExclusiveLock | f       |           
 0 | f           |             0
(1 row)

In order to resolve the issue, you need to let the backup complete or cancel the backup so that it releases the lock on gp_segment_configuration. Once this is done, gprecoverseg -F will proceed successfully since it will get the lock it needs on gp_segment_configuration.