sort operator behaves more slower on GPDB6 than GPDB5

Products

VMware Tanzu Greenplum

Issue/Introduction

The article describes the code path change between GPDDB5 and GPDB6 when doing sort, in short, the change was to make the query result more reliable hence resulting in some slowness.

Symptoms

Sometimes we noticed that in the EXPLAIN ANALYZE o/p the sort operator tends to spend more time on GPDB6 compared with GPDB5, below is an example:

-----GP5
Gather Motion 72:1  (slice2; segments: 72)  (cost=0.00..4764.37 rows=29720700 width=24)
  Rows out:  42824779 rows at destination with 207479 ms to first row, 298392 ms to end.
  ->  Window  (cost=0.00..3149.75 rows=412788 width=24)
        Partition By: c_clm_no
        Order By: transactionno
        Rows out:  Avg 594788.6 rows x 72 workers.  Max 42697067 rows (seg58) with 219204 ms to first row, 292433 ms to end.
        ->  Sort  (cost=0.00..3130.35 rows=412788 width=47)
              Sort Key: c_clm_no, transactionno
              Sort Method:  quicksort  Max Memory: 553KB  Avg Memory: 553KB (71 segments)
              Sort Method:  external merge  Disk: 1669088KB
              Rows out:  Avg 594788.6 rows x 72 workers.  Max 42697067 rows (seg58) with 219204 ms to first row, 277895 ms to end.
              Executor memory:  9330K bytes avg, 632510K bytes max (seg58).
              Work_mem used:  9330K bytes avg, 632510K bytes max (seg58). Workfile: (1 spilling)
              Work_mem wanted: 9266844K bytes avg, 9266844K bytes max (seg58) to lessen workfile I/O affecting 1 workers.
              ->  Redistribute Motion 72:72  (slice1; segments: 72)  (cost=0.00..1078.22 rows=412788 width=47)
                    Hash Key: c_clm_no
                    Rows out:  Avg 594788.6 rows x 72 workers at destination.  Max 42697067 rows (seg58) with 58 ms to first row, 9905 ms to end.
                    ->  Table Scan on in_p_inc_cache  (cost=0.00..1017.50 rows=412788 width=47)
                          Filter: substr(transactionno::text, 15, 1) = '1'::text AND NOT c_app_no IS NULL
                          Rows out:  Avg 594788.6 rows x 72 workers.  Max 595967 rows (seg68) with 74 ms to first row, 729 ms to end.
Slice statistics:
  (slice0)    Executor memory: 516K bytes.
  (slice1)    Executor memory: 1887K bytes avg x 72 workers, 1902K bytes max (seg12).
  (slice2)  * Executor memory: 9646K bytes avg x 72 workers, 632881K bytes max (seg58).  Work_mem: 632510K bytes max, 9266844K bytes wanted.
Statement statistics:
  Memory used: 524288K bytes
  Memory wanted: 18533986K bytes
Settings:  enable_nestloop=on; join_collapse_limit=5
Optimizer status: PQO version 3.114.0
Total runtime: 303536.847 ms

-------GP6 
Gather Motion 60:1  (slice2; segments: 60)  (cost=0.00..5444.86 rows=30435602 width=24) (actual time=470154.366..790490.065 rows=42824779 loops=1)
  ->  WindowAgg  (cost=0.00..3782.11 rows=507261 width=24) (actual time=713927.577..786483.115 rows=42697447 loops=1)
        Partition By: c_clm_no
        Order By: transactionno
        ->  Sort  (cost=0.00..3758.27 rows=507261 width=47) (actual time=713927.560..777588.258 rows=42697447 loops=1)
              Sort Key: c_clm_no, transactionno
              Sort Method:  quicksort  Memory: 20355kB
              Sort Method:  external merge  Disk: 1668960kB
              ->  Redistribute Motion 60:60  (slice1; segments: 60)  (cost=0.00..1196.29 rows=507261 width=47) (actual time=4.421..5294.459 rows=42697447 loops=1)
                    Hash Key: c_clm_no
                    ->  Seq Scan on in_p_inc_cache  (cost=0.00..1121.67 rows=507261 width=47) (actual time=1.346..1741.309 rows=716100 loops=1)
                          Filter: ((substr((transactionno)::text, 15, 1) = '1'::text) AND (NOT (c_app_no IS NULL)))
Planning time: 9.651 ms
  (slice0)    Executor memory: 616K bytes.
  (slice1)    Executor memory: 700K bytes avg x 60 workers, 700K bytes max (seg0).
* (slice2)    Executor memory: 29080K bytes avg x 60 workers, 1720296K bytes max (seg0).  Work_mem: 1720193K bytes max, 4357287K bytes wanted.
Memory used:  3145728kB
Memory wanted:  8714873kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution time: 794789.473 ms

where we can see sort spend 200000+ ms on GP5 however it can go up to 700000+ ms(more than 3 times) on GP6.

RCA

This is because of some code path changes in GP6,

the code path of sort in GPDB5 is as follows (when lc_collate = en_US.utf8):

set gp_enable_mk_sort to on ;
code path: strxfrm tupsort_compare_datum-->tupsort_compare_char-->strcmp

set gp_enable_mk_sort to off;
code path: bttextcmp-text_cmp-->varstr_cmp->strcoll

Most of the time, using strcoll is much slower than using strcmp. and we set gp_enable_mk_sort to on(which is the default setting); we will use strxfrm to transform string according to LC_COLLATE specified. However, strxfrm is not reliable in some cases, please refer to: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/MUplsaJHLg0/m/FpnxKC8rBgAJ

So in GPDB6 we use a macro TRUST_STRXFRM_MK_SORT to disable strxfrm, PR: https://github.com/greenplum-db/gpdb/pull/10502/files

the code path of sort in GPDB6 now becomes like (when lc_collate = en_US.utf8):

set gp_enable_mk_sort to on;
code path: bttextcmp->text_cmp-->varstr_cmp-->strcoll

set gp_enable_mk_sort to off;
code path: bttextcmp->text_cmp->varstr_cmp-->strcoll

Environment

Product Version: 6.23

Resolution

Workaround

The slowness is expected to some aspects, however, note that sometimes you can expedite the sort if you are certain that your sort order should behave according to some other locale, for example 'C', that way you can set lc_collate to C, and the code path on GP6 would then change as below:

set gp_enable_mk_sort to on;
code path: bttextcmp->text_cmp-->varstr_cmp-->memcmp

set gp_enable_mk_sort to off;
code path: bttextcmp->text_cmp->varstr_cmp-->memcmp