rq_5_low
has 6 waiting statements and 5 holders.
select rsqname, rsqcountlimit, rsqcountvalue, rsqholders, rsqwaiters from gp_toolkit.gp_resqueue_status where rsqwaiters<>0; rsqname | rsqcountlimit | rsqcountvalue | rsqholders | rsqwaiters ----------+---------------+---------------+------------+------------ rq_5_low | 5 | 5 | 5 | 62. Check the status of the resource queues with waiting statements. In this example, it is reporting 6 statements waiting as expected, but only 3 statements running.
select resqname,resqlast,resqstatus, resqtotal from gp_toolkit.gp_resq_activity_by_queue where resqname='rq_5_low'; resqname | resqlast | resqstatus | resqtotal -------------------+-------------------------------+--------------+-------------- rq_5_low | 2019-03-04 09:58:25.604617-05 | running | 3 rq_5_low | 2019-03-04 09:58:33.731977-05 | waiting | 63. Check
pg_stat_activity
for more context on these statements. Again, 6 statements are waiting='t
' but only 2 are waiting='f
'.f
' are '<IDLE> in transaction
'.
SELECT p.query_start, p.datname, p.procpid, p.sess_id, p.usename, rq.rsqname,substr(p.current_query,1,50), p.waiting,current_timestamp-p.query_start as duration from pg_stat_activity p inner join pg_roles r on r.rolname=p.usename inner join pg_resqueue rq on rq.oid=r.rolresqueue and rq.rsqname='rq_5_low' order by waiting; query_start | datname| procpid | sess_id | usename | rsqname | substr | waiting | duration -------------------------------+---------+---------+---------+---------+------------------+----------------------------------------------------+---------+----------------- 2019-03-04 09:58:09.682187-05 | testdb | 72857 | 5867939 | gpadmin | rq_5_low | <IDLE> in transaction | f | 02:52:35.54226 2019-03-04 09:58:25.604617-05 | testdb | 72646 | 5867934 | gpadmin | rq_5_low | <IDLE> in transaction | f | 02:52:19.61983 2019-03-04 09:58:24.991907-05 | testdb | 72645 | 5867933 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:20.23254 2019-03-04 09:58:24.675358-05 | testdb | 72853 | 5867938 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:20.549089 2019-03-04 09:58:25.163259-05 | testdb | 216044 | 5870436 | gpadmin | rq_5_low | select outer_as.target_table, outer_as.target_colu | t | 02:52:20.061188 2019-03-04 09:58:33.731977-05 | testdb | 220457 | 5870510 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:11.49247 2019-03-04 09:58:25.547321-05 | testdb | 72848 | 5867937 | gpadmin | rq_5_low | SELECT * from temp_dimension_result where delivery | t | 02:52:19.677126 2019-03-04 09:58:20.075098-05 | testdb | 43277 | 5867125 | gpadmin | rq_5_low | SELECT * FROM set_config('application_name', $1, F | t | 02:52:25.1493494. Compare the
pg_stat_activity
output to pg_locks_on_requeue
. You will see 3 locks granted and 6 waiting. Observe that PID 72848
is reporting both granted and waiting. The RQ locks are inconsistent.
select * from gp_toolkit.gp_locks_on_resqueue where lorrsqname ='rq_5_low'; lorusename | lorrsqname | lorlocktype | lorobjid | lortransaction | lorpid | lormode | lorgranted | lorwaiting ------------+------------------+----------------+------------+----------------+--------+---------------+------------+------------ gpadmin | rq_5_low | resource queue | 1414248080 | 893618520 | 72646 | ExclusiveLock | t | f gpadmin | rq_5_low | resource queue | 1414248080 | 893626203 | 72853 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893626211 | 216044 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893618508 | 72848 | ExclusiveLock | t | t gpadmin | rq_5_low | resource queue | 1414248080 | 893618508 | 72848 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893626298 | 220457 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893626119 | 43277 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893626210 | 72645 | ExclusiveLock | f | t gpadmin | rq_5_low | resource queue | 1414248080 | 893618505 | 72857 | ExclusiveLock | t | f5. Check the
ps
output for the processes associated with rq_5_low
.
psql -Atc "select lorpid from gp_toolkit.gp_locks_on_resqueue \ where lorrsqname ='rq_5_low';" \ | while read pid; do ps -ef | grep -v grep | grep $pid; done
gpadmin 72857 436754 2 80 0 517608 1253084 sk_wai 09:52 ? 00:07:21 postgres: con5867939 cmd590 idle in transaction gpadmin 43277 436754 0 80 0 248360 407400 sys_se 09:51 ? 00:00:44 postgres: con5867125 cmd959 con5867125 10.81.27.17(34406) cmd958 BIND queuing gpadmin 72853 436754 0 80 0 267556 406930 sys_se 09:52 ? 00:00:14 postgres: con5867938 cmd105 con5867938 10.205.35.162(16530) cmd104 BIND queuing gpadmin 72645 436754 0 80 0 461340 406082 sys_se 09:52 ? 00:00:14 postgres: con5867933 cmd109 con5867933 10.205.35.162(16512) cmd108 BIND queuing gpadmin 216044 436754 0 80 0 121744 234473 sys_se 09:58 ? 00:00:01 postgres: con5870436 cmd2 con5870436 10.205.35.162(16604) cmd1 BIND queuing gpadmin 72848 436754 1 80 0 488400 1253083 sys_se 09:52 ? 00:04:27 postgres: con5867937 cmd557 con5867937 10.205.35.162(16528) cmd556 BIND queuing gpadmin 72848 436754 1 80 0 488400 1253083 sys_se 09:52 ? 00:04:27 postgres: con5867937 cmd557 con5867937 10.205.35.162(16528) cmd556 BIND queuing gpadmin 72646 436754 2 80 0 699072 1253087 sk_wai 09:52 ? 00:07:18 postgres: con5867934 cmd557 idle in transaction gpadmin 220457 436754 0 80 0 18260 150606 sys_se 09:58 ? 00:00:00 postgres: con5870510 cmd1 con5870510 10.205.35.162(16610) BIND queuingThis resource queue is actually holding 5 locks, but only 3 are reported in pg_locks. There is a discrepancy in the counter for RQ locks held and RQ active statements, and new statements are queueing.
(gdb) bt #0 0x00002b58768de207 in semop () from /data/logs/199292/analyze_session_con5867933_2019-03-04_15-29-02/analyze_session_con5867933_2019-03-04_15-29-02/analyze_session_idb150.howard.ms.com_con5867933_2019-03-04_15-29-02/tmp/analyze_session_idb150.howard.ms.com_con5867933_2019-03-04_15-29-02/process-collector-2019-03-04_15-29-23/packcore-core.72645/lib64/libc.so.6 #1 0x00000000008de934 in PGSemaphoreLock (sema=0x2b588a16f9e0, interruptOK=1 '\001') at pg_sema.c:437 #2 0x0000000000977d88 in ResProcSleep (lockmode=0, locallock=<optimized out>, incrementSet=<optimized out>) at proc.c:1814 #3 0x0000000000b5645d in ResWaitOnLock (incrementSet=<optimized out>, owner=<optimized out>, locallock=<optimized out>) at resqueue.c:1120
(gdb) f 2 (gdb) info locals lock = 0x2b5886b1d368 proclock = 0x57e40 partitionLock = 228 selflock = <optimized out> __func__ = "ResProcSleep"
(gdb) p *lock $1 = {tag = {locktag_field1 = 1414248080, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 8 '\b', locktag_lockmethodid = 3 '\003'}, grantMask = 128, waitMask = 128, procLocks = {prev = 159054472, next = 159838120}, waitProcs = {links = {prev = 202822480, next = 202824544}, size = 6}, requested = {0, 0, 0, 0, 0, 0, 0, 11, 0, 0}, nRequested = 11, granted = {0, 0,0, 0, 0, 0, 0, 5, 0, 0}, nGranted = 5}
pg_terminate_backend()
. Use Resource groups instead of resource queues for resource management in GPDB to avoid this issue.