The article describes the code path change between GPDDB5 and GPDB6 when doing
sort, in short, the change was to make the query result more reliable hence resulting in some slowness.
Symptoms
Sometimes we noticed that in the
EXPLAIN ANALYZE o/p the
sort operator tends to spend more time on GPDB6 compared with GPDB5, below is an example:
-----GP5
Gather Motion 72:1 (slice2; segments: 72) (cost=0.00..4764.37 rows=29720700 width=24)
Rows out: 42824779 rows at destination with 207479 ms to first row, 298392 ms to end.
-> Window (cost=0.00..3149.75 rows=412788 width=24)
Partition By: c_clm_no
Order By: transactionno
Rows out: Avg 594788.6 rows x 72 workers. Max 42697067 rows (seg58) with 219204 ms to first row, 292433 ms to end.
-> Sort (cost=0.00..3130.35 rows=412788 width=47)
Sort Key: c_clm_no, transactionno
Sort Method: quicksort Max Memory: 553KB Avg Memory: 553KB (71 segments)
Sort Method: external merge Disk: 1669088KB
Rows out: Avg 594788.6 rows x 72 workers. Max 42697067 rows (seg58) with 219204 ms to first row, 277895 ms to end.
Executor memory: 9330K bytes avg, 632510K bytes max (seg58).
Work_mem used: 9330K bytes avg, 632510K bytes max (seg58). Workfile: (1 spilling)
Work_mem wanted: 9266844K bytes avg, 9266844K bytes max (seg58) to lessen workfile I/O affecting 1 workers.
-> Redistribute Motion 72:72 (slice1; segments: 72) (cost=0.00..1078.22 rows=412788 width=47)
Hash Key: c_clm_no
Rows out: Avg 594788.6 rows x 72 workers at destination. Max 42697067 rows (seg58) with 58 ms to first row, 9905 ms to end.
-> Table Scan on in_p_inc_cache (cost=0.00..1017.50 rows=412788 width=47)
Filter: substr(transactionno::text, 15, 1) = '1'::text AND NOT c_app_no IS NULL
Rows out: Avg 594788.6 rows x 72 workers. Max 595967 rows (seg68) with 74 ms to first row, 729 ms to end.
Slice statistics:
(slice0) Executor memory: 516K bytes.
(slice1) Executor memory: 1887K bytes avg x 72 workers, 1902K bytes max (seg12).
(slice2) * Executor memory: 9646K bytes avg x 72 workers, 632881K bytes max (seg58). Work_mem: 632510K bytes max, 9266844K bytes wanted.
Statement statistics:
Memory used: 524288K bytes
Memory wanted: 18533986K bytes
Settings: enable_nestloop=on; join_collapse_limit=5
Optimizer status: PQO version 3.114.0
Total runtime: 303536.847 ms
-------GP6
Gather Motion 60:1 (slice2; segments: 60) (cost=0.00..5444.86 rows=30435602 width=24) (actual time=470154.366..790490.065 rows=42824779 loops=1)
-> WindowAgg (cost=0.00..3782.11 rows=507261 width=24) (actual time=713927.577..786483.115 rows=42697447 loops=1)
Partition By: c_clm_no
Order By: transactionno
-> Sort (cost=0.00..3758.27 rows=507261 width=47) (actual time=713927.560..777588.258 rows=42697447 loops=1)
Sort Key: c_clm_no, transactionno
Sort Method: quicksort Memory: 20355kB
Sort Method: external merge Disk: 1668960kB
-> Redistribute Motion 60:60 (slice1; segments: 60) (cost=0.00..1196.29 rows=507261 width=47) (actual time=4.421..5294.459 rows=42697447 loops=1)
Hash Key: c_clm_no
-> Seq Scan on in_p_inc_cache (cost=0.00..1121.67 rows=507261 width=47) (actual time=1.346..1741.309 rows=716100 loops=1)
Filter: ((substr((transactionno)::text, 15, 1) = '1'::text) AND (NOT (c_app_no IS NULL)))
Planning time: 9.651 ms
(slice0) Executor memory: 616K bytes.
(slice1) Executor memory: 700K bytes avg x 60 workers, 700K bytes max (seg0).
* (slice2) Executor memory: 29080K bytes avg x 60 workers, 1720296K bytes max (seg0). Work_mem: 1720193K bytes max, 4357287K bytes wanted.
Memory used: 3145728kB
Memory wanted: 8714873kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 794789.473 ms
where we can see sort spend 200000+ ms on GP5 however it can go up to 700000+ ms(more than 3 times) on GP6.
RCA
This is because of some code path changes in GP6,
- the code path of sort in GPDB5 is as follows (when lc_collate = en_US.utf8):
set gp_enable_mk_sort to on ;
code path: strxfrm tupsort_compare_datum-->tupsort_compare_char-->strcmp
set gp_enable_mk_sort to off;
code path: bttextcmp-text_cmp-->varstr_cmp->strcoll
Most of the time, using
strcoll is much slower than using
strcmp. and we
set gp_enable_mk_sort to on(which is the default setting); we will use strxfrm to transform string according to
LC_COLLATE specified. However,
strxfrm is not reliable in some cases, please refer to: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/MUplsaJHLg0/m/FpnxKC8rBgAJ
So in GPDB6 we use a macro
TRUST_STRXFRM_MK_SORT to disable
strxfrm, PR: https://github.com/greenplum-db/gpdb/pull/10502/files
- the code path of sort in GPDB6 now becomes like (when lc_collate = en_US.utf8):
set gp_enable_mk_sort to on;
code path: bttextcmp->text_cmp-->varstr_cmp-->strcoll
set gp_enable_mk_sort to off;
code path: bttextcmp->text_cmp->varstr_cmp-->strcoll