Resource queue statements hanging below active query limit in Pivotal Greenplum
search cancel

Resource queue statements hanging below active query limit in Pivotal Greenplum

book

Article ID: 296104

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:
Imagine a resource queue that has an active_statements limit of 5. In this example, statements are queueing and it is not limited by memory, however only 3 running statements are being reported by the RQ.

1. Determine which resource queues have waiting statements. In this case, rq_5_low has 6 waiting statements and 5 holders. 
select rsqname, rsqcountlimit, rsqcountvalue, rsqholders, rsqwaiters from gp_toolkit.gp_resqueue_status where rsqwaiters<>0;

 rsqname  | rsqcountlimit | rsqcountvalue | rsqholders | rsqwaiters
----------+---------------+---------------+------------+------------
rq_5_low  |             5 |             5 |          5 |          6
2. Check the status of the resource queues with waiting statements. In this example, it is reporting 6 statements waiting as expected, but only 3 statements running.
select resqname,resqlast,resqstatus, resqtotal from gp_toolkit.gp_resq_activity_by_queue 
where resqname='rq_5_low';

 resqname   | resqlast                     | resqstatus   | resqtotal
-------------------+-------------------------------+--------------+--------------
 rq_5_low  | 2019-03-04 09:58:25.604617-05 | running      |    3
 rq_5_low  | 2019-03-04 09:58:33.731977-05 | waiting      |    6
3. Check pg_stat_activity for more context on these statements. Again, 6 statements are waiting='t' but only 2 are waiting='f'.

Note: The 2 reporting waiting='f' are '<IDLE> in transaction'.
SELECT
p.query_start,
p.datname,
p.procpid,
p.sess_id,
p.usename,
rq.rsqname,substr(p.current_query,1,50),
p.waiting,current_timestamp-p.query_start as duration
from pg_stat_activity p 
inner join pg_roles r on r.rolname=p.usename 
inner join pg_resqueue rq on rq.oid=r.rolresqueue and rq.rsqname='rq_5_low' 
order by waiting;
 
query_start                   | datname| procpid | sess_id | usename | rsqname | substr            | waiting | duration
-------------------------------+---------+---------+---------+---------+------------------+----------------------------------------------------+---------+-----------------
2019-03-04 09:58:09.682187-05 | testdb | 72857 | 5867939 | gpadmin | rq_5_low | <IDLE> in transaction | f | 02:52:35.54226
2019-03-04 09:58:25.604617-05 | testdb | 72646 | 5867934 | gpadmin | rq_5_low | <IDLE> in transaction | f | 02:52:19.61983
2019-03-04 09:58:24.991907-05 | testdb | 72645 | 5867933 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:20.23254
2019-03-04 09:58:24.675358-05 | testdb | 72853 | 5867938 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:20.549089
2019-03-04 09:58:25.163259-05 | testdb | 216044 | 5870436 | gpadmin | rq_5_low | select outer_as.target_table, outer_as.target_colu | t | 02:52:20.061188
2019-03-04 09:58:33.731977-05 | testdb | 220457 | 5870510 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:11.49247
2019-03-04 09:58:25.547321-05 | testdb | 72848 | 5867937 | gpadmin | rq_5_low | SELECT * from temp_dimension_result where delivery | t | 02:52:19.677126
2019-03-04 09:58:20.075098-05 | testdb | 43277 | 5867125 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:25.149349
4. Compare the pg_stat_activity output to pg_locks_on_requeue. You will see 3 locks granted and 6 waiting. Observe that PID 72848 is reporting both granted and waiting. The RQ locks are inconsistent.
select * from gp_toolkit.gp_locks_on_resqueue 
where lorrsqname ='rq_5_low';

 lorusename |    lorrsqname    |  lorlocktype   |  lorobjid  | lortransaction | lorpid |    lormode    | lorgranted | lorwaiting
------------+------------------+----------------+------------+----------------+--------+---------------+------------+------------
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893618520 |  72646 | ExclusiveLock | t          | f
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893626203 |  72853 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893626211 | 216044 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893618508 |  72848 | ExclusiveLock | t          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893618508 |  72848 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893626298 | 220457 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893626119 |  43277 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893626210 |  72645 | ExclusiveLock | f          | t
 gpadmin    | rq_5_low | resource queue | 1414248080 |      893618505 |  72857 | ExclusiveLock | t          | f
5. Check the ps output for the processes associated with rq_5_low.
psql -Atc "select lorpid from gp_toolkit.gp_locks_on_resqueue \
where lorrsqname ='rq_5_low';" \
| while read pid; do ps -ef | grep -v grep | grep $pid; done
gpadmin   72857 436754  2  80   0 517608 1253084 sk_wai 09:52 ?     00:07:21 postgres:  con5867939  cmd590 idle in transaction        
gpadmin   43277 436754  0  80   0 248360 407400 sys_se 09:51 ?      00:00:44 postgres:  con5867125  cmd959 con5867125 10.81.27.17(34406) cmd958 BIND queuing
gpadmin   72853 436754  0  80   0 267556 406930 sys_se 09:52 ?      00:00:14 postgres:  con5867938  cmd105 con5867938 10.205.35.162(16530) cmd104 BIND queuing
gpadmin   72645 436754  0  80   0 461340 406082 sys_se 09:52 ?      00:00:14 postgres:  con5867933  cmd109 con5867933 10.205.35.162(16512) cmd108 BIND queuing
gpadmin  216044 436754  0  80   0 121744 234473 sys_se 09:58 ?      00:00:01 postgres:  con5870436  cmd2 con5870436 10.205.35.162(16604) cmd1 BIND queuing
gpadmin   72848 436754  1  80   0 488400 1253083 sys_se 09:52 ?     00:04:27 postgres:  con5867937  cmd557 con5867937 10.205.35.162(16528) cmd556 BIND queuing
gpadmin   72848 436754  1  80   0 488400 1253083 sys_se 09:52 ?     00:04:27 postgres:  con5867937  cmd557 con5867937 10.205.35.162(16528) cmd556 BIND queuing
gpadmin   72646 436754  2  80   0 699072 1253087 sk_wai 09:52 ?     00:07:18 postgres:  con5867934  cmd557 idle in transaction        
gpadmin  220457 436754  0  80   0 18260 150606 sys_se 09:58 ?       00:00:00 postgres:  con5870510  cmd1 con5870510 10.205.35.162(16610) BIND queuing
This resource queue is actually holding 5 locks, but only 3 are reported in pg_locks. There is a discrepancy in the counter for RQ locks held and RQ active statements, and new statements are queueing.

This can be confirmed by debugging one of the waiting processes.
(gdb) bt
#0  0x00002b58768de207 in semop ()
   from /data/logs/199292/analyze_session_con5867933_2019-03-04_15-29-02/analyze_session_con5867933_2019-03-04_15-29-02/analyze_session_idb150.howard.ms.com_con5867933_2019-03-04_15-29-02/tmp/analyze_session_idb150.howard.ms.com_con5867933_2019-03-04_15-29-02/process-collector-2019-03-04_15-29-23/packcore-core.72645/lib64/libc.so.6
#1  0x00000000008de934 in PGSemaphoreLock (sema=0x2b588a16f9e0, interruptOK=1 '\001') at pg_sema.c:437
#2  0x0000000000977d88 in ResProcSleep (lockmode=0, locallock=<optimized out>, incrementSet=<optimized out>) at proc.c:1814
#3  0x0000000000b5645d in ResWaitOnLock (incrementSet=<optimized out>, owner=<optimized out>, locallock=<optimized out>) at resqueue.c:1120
(gdb) f 2
(gdb) info locals
lock = 0x2b5886b1d368
proclock = 0x57e40
partitionLock = 228
selflock = <optimized out>
__func__ = "ResProcSleep"
(gdb) p *lock
$1 = {tag = {locktag_field1 = 1414248080, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 8 '\b', locktag_lockmethodid = 3 '\003'}, grantMask = 128, waitMask = 128,
  procLocks = {prev = 159054472, next = 159838120}, waitProcs = {links = {prev = 202822480, next = 202824544}, size = 6}, requested = {0, 0, 0, 0, 0, 0, 0, 11, 0, 0}, nRequested = 11, granted = {0, 0,0, 0, 0, 0, 0, 5, 0, 0}, nGranted = 5}

Environment


Cause

In certain situations, the RQ's rsqholders will not decrease after a statement is killed with pg_terminate_backend()

This is a known issue with the Resource Queue counter. In certain cases, the counter is not reliable and properly handled. This is the case for Pivotal Greenplum Database (GPDB) v4.3.x and v5.x.
Note: Resource groups handle these counters correctly.

Resolution

Workaround

  1. Cancel RQ statements that are '<IDLE in transaction>' and holding locks. Refer to this this KB article for additional information: Query "<IDLE> in transaction" Uses a Resource Queue Slot. This can free a slot for a waiting statement and get the RQ processing again.
  2. The counters are tracked in shared memory, therefore, restarting GPDB will reset them and clear the issue.
Note: There will be a discrepancy in the resource queue counter until GPDB is restarted.


Permanent Fix

Use Resource groups instead of resource queues for resource management in GPDB to avoid this issue.