GPDB Queries can be blocked by two main types of locks.
Explicit Lock (Database Lock): Block reads and/or writes in a transaction to satisfy ACID. Queries will obtain locks on objects to ensure that data is consistent in the database. These locks can be found using ps_stat_activity and pg_locks.
Lightweight Lock (LWLock): Lock that is obtained in memory to maintain Read/Write consistency. In Postgres 9.5+ LWLock type is able to be tracked in pg_stat_activity and pg_locks. Since GPDB 6.x is on Postgres 9.4, it will be much harder to track what query is blocking the current query.
In Greenplum, the symptom that requires you to find the blocking query due Lightweight locks is usually due to queries which require a lot of sub-transactions. Sub-transactions take many of these types of locks and as a result, it can cause catalog query slowdown. It can also affect other queries as well.
Select count(*) from pg_class may take much longer than usual to complete. If this is the case then you can check for the following characteristics:
1.) If a query is taking a long time, check pg_stat_activity:
orange=# select * from pg_stat_activity where pid = 273321; -[ RECORD 1 ]----+------------------------------------------ datid | 16384 datname | db01 pid | 273321 sess_id | 322694 usesysid | 3307799 usename | gpadmin application_name | client_addr | 127.0.0.1 client_hostname | client_port | 45180 backend_start | 2022-04-04 11:30:51.647246-04 xact_start | 2022-04-04 11:33:01.854915-04 query_start | 2022-04-04 11:33:01.86072-04 state_change | 2022-04-04 11:33:01.86072-04 waiting | f state | active backend_xid | 52657760 backend_xmin | 52544047 query | select xxxxxxxxxx waiting_reason | rsgid | 2637322 rsgname | rg_customer_rpt rsgqueueduration |
The key here is the query is active and waiting = f, which means it appears it is not waiting on a lock.
However, running strace against the PID (strace -p 273321) shows the query is waiting on semop() system call, which means it is waiting on a lock:
[gpadmin@mdw ~]$ strace -p 273321 strace: Process 273321 attached semop(2457692, [{14, -1, 0}], 1
Run gcore against the PID (gcore 273321) to generate a core file in the current directly.
Then run gdb on the core file to examine it.
[gpadmin@mdw ~]$ gdb `which postgres` core.273321
(gdb) thread 2 [Switching to thread 2 (LWP 273321)] #0 0x00007fa41fe93307 in semop () from /data/logs/308196/vmware_04042022/packcore-core.273321/lib64/libc.so.6 (gdb) bt #0 0x00007fa41fe93307 in semop () from /data/logs/308196/vmware_04042022/packcore-core.273321/lib64/libc.so.6 #1 0x00000000009f9b37 in PGSemaphoreLock (sema=sema@entry=0x7fa40e96edd0, interruptOK=interruptOK@entry=0 '\000') at pg_sema.c:422 #2 0x0000000000a74cbd in LWLockAcquireCommon (val=0, valptr=0x0, mode=mode@entry=LW_EXCLUSIVE, l=0x7fa3e4afd1a0) at lwlock.c:679 #3 LWLockAcquire (l=0x7fa3e4afd1a0, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:510 #4 0x0000000000733486 in SubTransSetParent (xid=52658969, parent=52657760, overwriteOK=overwriteOK@entry=0 '\000') at subtrans.c:137 #5 0x000000000072a9e9 in AssignTransactionId (s=s@entry=0x3c40848) at xact.c:633 #6 0x000000000072e60e in GetCurrentTransactionId () at xact.c:486 #7 0x00000000006f5343 in heap_update_internal (relation=relation@entry=0x7fa3e4a39168, otid=otid@entry=0x5c2427c, newtup=newtup@entry=0x5c24278, cid=<optimized out>, crosscheck=crosscheck@entry=0x0, wait=wait@entry=1 '\001', hufd=0x7ffc12561a90, lockmode=0x7ffc12561a8c, simple=1 '\001') at heapam.c:3530 #8 0x00000000006fd599 in simple_heap_update (relation=relation@entry=0x7fa3e4a39168, otid=otid@entry=0x5c2427c, tup=tup@entry=0x5c24278) at heapam.c:4581 #9 0x0000000000bcde17 in RelationSetNewRelfilenode (relation=relation@entry=0x7fa423273488, freezeXid=52544047, minmulti=minmulti@entry=1) at relcache.c:3305 #10 0x0000000000873c1d in ExecuteTruncate (stmt=stmt@entry=0x586dd68) at tablecmds.c:1831 #11 0x0000000000a8d51b in standard_ProcessUtility (parsetree=0x586dd68, queryString=0x586d318 "TRUNCATE TABLE mpo_ms_cust_sales_pre_stg_abu_sales", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0x1605ce0 <spi_printtupDR>, completionTag=0x7ffc12561ff0 "") at utility.c:620 #12 0x00007fa415d3c65a in AdaptorProcessUtility (parsetree=0x586dd68, queryString=0x586d318 "TRUNCATE TABLE mpo_ms_cust_sales_pre_stg_abu_sales", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0x1605ce0 <spi_printtupDR>, completionTag=0x7ffc12561ff0 "") at /tmp/build/65c75183/gpcc/extension/metrics_collector/gp6/metrics_collector_gp.c:420 #13 0x00007fa415d3c99d in EmitUtilityQueryInfo (parsetree=0x586dd68, queryString=0x586d318 "TRUNCATE TABLE mpo_ms_cust_sales_pre_stg_abu_sales", context=PROCESS_UTILITY_QUERY, params=0x0, dest=0x1605ce0 <spi_printtupDR>, completionTag=0x7ffc12561ff0 "") at /tmp/build/65c75183/gpcc/extension/metrics_collector/gp6/metrics_collector_gp.c:507 #14 0x00000000008f84ea in _SPI_execute_plan (plan=plan@entry=0x586d8b8, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=0) at spi.c:2319 #15 0x00000000008f8ec8 in SPI_execute_plan_with_paramlist (plan=0x586d8b8, params=params@entry=0x0, read_only=<optimized out>, tcount=0) at spi.c:499 #16 0x00007fa3e41df2dd in exec_stmt_execsql (estate=estate@entry=0x7ffc12562420, stmt=stmt@entry=0x8ae2228) at pl_exec.c:3334 #17 0x00007fa3e41e1a9b in exec_stmt (stmt=0x8ae2228, estate=0x7ffc12562420) at pl_exec.c:1467 #18 exec_stmts (estate=0x7ffc12562420, stmts=<optimized out>) at pl_exec.c:1366 #19 0x00007fa3e41e41d9 in exec_stmt_block (estate=estate@entry=0x7ffc12562420, block=0x5b31ac8) at pl_exec.c:1165 #20 0x00007fa3e41e4a9d in plpgsql_exec_function (func=func@entry=0xc27deb8, fcinfo=fcinfo@entry=0x5a6ce10, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:348 #21 0x00007fa3e41d6911 in plpgsql_call_handler (fcinfo=0x5a6ce10) at pl_handler.c:253 #22 0x00000000008c4a3e in ExecMakeFunctionResultNoSets (isDone=<optimized out>, isNull=0x5a6d750 "", econtext=0x42a13f0, fcache=0x5a6cda0) at execQual.c:2161 #23 ExecEvalFunc (fcache=0x5a6cda0, econtext=0x42a13f0, isNull=0x5a6d750 "", isDone=<optimized out>) at execQual.c:2584 #24 0x00000000008ca06d in ExecTargetList (isDone=0x7ffc12562778, itemIsDone=0x5a6d7d8, isnull=0x5a6d750 "", values=0x5a6c8a8, econtext=0x42a13f0, targetlist=0x5a6d7a8) at execQual.c:6337 #25 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffc12562778) at execQual.c:6552 #26 0x00000000008ea63d in ExecResult (node=node@entry=0x5a6c8c0) at nodeResult.c:218 #27 0x00000000008bcbf8 in ExecProcNode (node=node@entry=0x5a6c8c0) at execProcnode.c:970 #28 0x00000000008ee603 in ExecSetParamPlan (node=node@entry=0x5a6ef40, econtext=0x42a13f0, queryDesc=queryDesc@entry=0x425a888) at nodeSubplan.c:1065 #29 0x0000000000cd501c in preprocess_initplans (queryDesc=queryDesc@entry=0x425a888) at cdbsubplan.c:161 #30 0x00000000008b7327 in standard_ExecutorStart (queryDesc=0x425a888, eflags=0) at execMain.c:701 #31 0x00000000008f64eb in _SPI_pquery (queryDesc=queryDesc@entry=0x425a888, tcount=tcount@entry=0, fire_triggers=1 '\001') at spi.c:2618 #32 0x00000000008f87b3 in _SPI_execute_plan (plan=plan@entry=0x75548d8, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=0) at spi.c:2311 #33 0x00000000008f8ec8 in SPI_execute_plan_with_paramlist (plan=0x75548d8, params=params@entry=0x0, read_only=<optimized out>, tcount=0) at spi.c:499 #34 0x00007fa3e41df2dd in exec_stmt_execsql (estate=estate@entry=0x7ffc12563730, stmt=stmt@entry=0x416e1b8) at pl_exec.c:3334 #35 0x00007fa3e41e1a9b in exec_stmt (stmt=0x416e1b8, estate=0x7ffc12563730) at pl_exec.c:1467 #36 exec_stmts (estate=0x7ffc12563730, stmts=<optimized out>) at pl_exec.c:1366 #37 0x00007fa3e41e4717 in exec_stmt_block (estate=estate@entry=0x7ffc12563730, block=block@entry=0x416e258) at pl_exec.c:1304 #38 0x00007fa3e41e219b in exec_stmt (stmt=0x416e258, estate=0x7ffc12563730) at pl_exec.c:1399 #39 exec_stmts (estate=0x7ffc12563730, stmts=<optimized out>) at pl_exec.c:1366 #40 0x00007fa3e41e33ba in exec_stmt_if (stmt=0x416e2f8, estate=0x7ffc12563730) at pl_exec.c:1664 #41 exec_stmt (stmt=0x416e2f8, estate=0x7ffc12563730) at pl_exec.c:1415 #42 exec_stmts (estate=0x7ffc12563730, stmts=<optimized out>) at pl_exec.c:1366 #43 0x00007fa3e41e4717 in exec_stmt_block (estate=estate@entry=0x7ffc12563730, block=0x3b19a00) at pl_exec.c:1304 #44 0x00007fa3e41e4a9d in plpgsql_exec_function (func=func@entry=0x3c0d6c8, fcinfo=fcinfo@entry=0x5b00ee0, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:348 #45 0x00007fa3e41d6911 in plpgsql_call_handler (fcinfo=0x5b00ee0) at pl_handler.c:253 #46 0x00000000008c4a3e in ExecMakeFunctionResultNoSets (isDone=<optimized out>, isNull=0x5b019d8 " ", econtext=0x5b00c00, fcache=0x5b00e70) at execQual.c:2161 #47 ExecEvalFunc (fcache=0x5b00e70, econtext=0x5b00c00, isNull=0x5b019d8 " ", isDone=<optimized out>) at execQual.c:2584 #48 0x00000000008ca06d in ExecTargetList (isDone=0x7ffc12563a88, itemIsDone=0x5b01af0, isnull=0x5b019d8 " ", values=0x5b019c0, econtext=0x5b00c00, targetlist=0x5b01ac0) at execQual.c:6337 #49 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffc12563a88) at execQual.c:6552 #50 0x00000000008ea63d in ExecResult (node=node@entry=0x5b007a0) at nodeResult.c:218 #51 0x00000000008bcbf8 in ExecProcNode (node=node@entry=0x5b007a0) at execProcnode.c:970 #52 0x00000000008b40c9 in ExecutePlan (estate=estate@entry=0x5b004d8, planstate=0x5b007a0, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=200, direction=direction@entry=ForwardScanDirection, dest=0x3af4588) at execMain.c:2990 #53 0x00000000008b4dbc in ExecutePlan (dest=0x3af4588, direction=ForwardScanDirection, numberTuples=200, sendTuples=1 '\001', operation=CMD_SELECT, planstate=<optimized out>, estate=0x5b004d8) at execMain.c:2961 #54 standard_ExecutorRun (queryDesc=0x3e74398, direction=ForwardScanDirection, count=200) at execMain.c:1003 #55 0x0000000000a8a0a7 in PortalRunSelect (portal=0x3c26e48, forward=<optimized out>, count=200, dest=<optimized out>) at pquery.c:1151 #56 0x0000000000a8c091 in PortalRun (portal=portal@entry=0x3c26e48, count=count@entry=200, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x3af4588, altdest=altdest@entry=0x3af4588, completionTag=completionTag@entry=0x7ffc12564020 "") at pquery.c:992 #57 0x0000000000a8895c in exec_execute_message (max_rows=200, portal_name=0x3af4128 "") at postgres.c:2766 #58 PostgresMain (argc=<optimized out>, argv=argv@entry=0x3ad3bd8, dbname=<optimized out>, username=<optimized out>) at postgres.c:5525 #59 0x00000000006ade52 in BackendRun (port=0x3b054c0) at postmaster.c:4815 #60 BackendStartup (port=0x3b054c0) at postmaster.c:4472 #61 ServerLoop () at postmaster.c:1948 #62 0x0000000000a0e3a2 in PostmasterMain (argc=argc@entry=6, argv=argv@entry=0x3ad1af0) at postmaster.c:1518 #63 0x00000000006b1e31 in main (argc=6, argv=0x3ad1af0) at main.c:245
You'll see the following entries:
#2 0x0000000000a74cbd in LWLockAcquireCommon (val=0, valptr=0x0, mode=mode@entry=LW_EXCLUSIVE, l=0x7fa3e4afd1a0) at lwlock.c:679 #3 LWLockAcquire (l=0x7fa3e4afd1a0, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:510 #4 0x0000000000733486 in SubTransSetParent (xid=52658969, parent=52657760, overwriteOK=overwriteOK@entry=0 '\000') at subtrans.c:137
A pstack will also show a similar entry for LWLockAcquire.
Greenplum Database 6.x
To find the query that is hold these LW Locks, you can use the following procedure:
NOTE: Core Files must be enabled for this to work.
6.x:
1.) gcore <PID>
2.) gdb `which postgres` core.<PID>
3.) Find the right thread using thread <number> command.
[Current thread is 1 (LWP 291022)] (gdb) thread 2 [Switching to thread 2 (LWP 291019)] #0 0x00007fa41fe93307 in semop () from /data/logs/308196/vmware_04042022/packcore-core.291019/lib64/libc.so.6
4.) Find the pid using this gdb function:
(gdb) set $ipx = 0 (gdb) while ($ipx < ProcGlobal->allProcCount) if (ProcGlobal->allPgXact[$ipx]->overflowed) print ProcGlobal->allProcs[$ipx]->subxids print ProcGlobal->allProcs[$ipx]->pid print $ipx end set $ipx = $ipx + 1 end $1 = {xids = {52653923, 52653946, 52653962, 52653984, 52654060, 52654115, 52654157, 52654300, 52654345, 52654356, 52654419, 52654491, 52654529, 52654541, 52654618, 52654652, 52654733, 52654762, 52654803, 52654970, 52655084, 52655129, 52655276, 52655413, 52655733, 52655784, 52655833, 52655839, 52655862, 52655960, 52655961, 52655985, 52656054, 52656091, 52656115, 52656141, 52656233, 52656280, 52656287, 52656368, 52656426, 52656502, 52656517, 52656542, 52656614, 52656667, 52656688, 52656760, 52656777, 52656887, 52657015, 52657029, 52657059, 52657197, 52657213, 52657340, 52657423, 52657432, 52657537, 52657960, 52657972, 52658036, 52658167, 52658231}} $2 = 249963 $3 = 896
In this case the PID is 249963. Check this against pg_stat_activity and usually cancelling this query will alleviate the performance issue.
5.x
Everything is the same as 6.x, except step 4 above:
set $ipx = 0 while ($ipx < procArray->numProcs) if (procArray->procs[$ipx]->subxids.overflowed) print procArray->procs[$ipx]->subxids print procArray->procs[$ipx]->pid print $ipx end set $ipx = $ipx + 1 end