Query Runs Slower when Joining Columns with the Different Data Types

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

When executing a query between the two tables where the join columns are of different data types, there is a noticeable performance lag compared to joining the columns of the same data types. The following is an example of a join between a bigint and a regular integer:

create table sales (id integer, lead_id integer, rand_num integer) distributed by (id);
create table leads (lead_id integer, rand_num integer) distributed by (lead_id);
testdb=# \d sales;
      Table "public.sales"
  Column  |  Type   | Modifiers
----------+---------+-----------
 id       | integer |
 lead_id  | integer  |
 rand_num | integer |
Distributed by: (id)

testdb=# \d leads;
      Table "public.leads"
  Column  |  Type   | Modifiers
----------+---------+-----------
 lead_id  | integer |
 rand_num | integer |
Distributed by: (lead_id)

Insert some test data into the tables:

insert into sales SELECT 
   generate_series (1,10000), (random()*2000::int), (random()*200::int)
FROM  
   generate_series (1,1) AS x(n) ;

insert into leads SELECT 
      generate_series (1,10000), (random()*2000::int)
   FROM  
      generate_series (1,1) AS x(n);

With the tables joined on lead_id with no datatype mismatch, the execution time is about the same:

testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
NOTICE:  One or more columns in the following table(s) do not have statistics: sales
HINT:  For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
  id   | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
    31 |     666 |      141 |     666 |     1976
   942 |    1052 |       52 |    1052 |     1941
   974 |     933 |       43 |     933 |      128
  1027 |    1654 |       27 |    1654 |      225

Time: 542.267 ms

testdb=# set optimizer=off;
SET
Time: 142.155 ms
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
  id   | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
   324 |    1213 |       82 |    1213 |      403
   569 |     980 |      168 |     980 |      357
   681 |     687 |        6 |     687 |     1699
   926 |     804 |       34 |     804 |      312
  1091 |    1635 |      143 |    1635 |     1984

Time: 178.780 ms
However, when you change the lead_id in the sales table to a bigint, the execution time changes for the query with the optimizer off:

testdb=# alter table sales alter lead_id type bigint;
ALTER TABLE
testdb=# set optimizer=on;
SET
Time: 5.063 ms

testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
  id   | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
  3591 |      16 |      179 |      16 |     1590
  3024 |      32 |        4 |      32 |      514
  3639 |      32 |       91 |      32 |      514
  7217 |      32 |      146 |      32 |      514
  4578 |      64 |      198 |      64 |     1680

Time: 201.281 ms
testdb=# set optimizer=off;
SET
Time: 5.063 ms

testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
  id   | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
  2907 |       2 |      193 |       2 |     1830
  1182 |      34 |       38 |      34 |     1781
  8477 |      34 |      158 |      34 |     1781
  7238 |      50 |      103 |      50 |      463
  5764 |     130 |       40 |     130 |     1057

Time: 2106.507 ms

As you can see with the optimizer off, a simple join between columns of different data types is about 10x slower in the test case of 10000 rows of data.

Environment

Cause

When you run an explain analyze on the select query, you'll see that the legacy optimizer chooses to use a nested loop to perform the join on columns of different data types instead of a hash join. Nested loops are known to be a more resource intensive operation and it is encouraged to avoid them when working with larger datasets.

Explain plan for no datatype mismatch:

Gather Motion 16:1  (slice2; segments: 16)  (cost=241.00..707.00 rows=10000 width=20)
   ->  Hash Join  (cost=241.00..707.00 rows=625 width=20)
         Hash Cond: a.lead_id = b.lead_id
         ->  Redistribute Motion 16:16  (slice1; segments: 16)  (cost=0.00..316.00 rows=625 width=12)
               Hash Key: a.lead_id
               ->  Seq Scan on sales a  (cost=0.00..116.00 rows=625 width=12)
         ->  Hash  (cost=116.00..116.00 rows=625 width=8)
               ->  Seq Scan on leads b  (cost=0.00..116.00 rows=625 width=8)
 Settings:  optimizer=off
 Optimizer status: legacy query optimizer
(10 rows)

Data Type mismatch:

Gather Motion 16:1  (slice2; segments: 16)  (cost=126.00..36001942.00 rows=10000 width=24)
   ->  Nested Loop  (cost=126.00..36001942.00 rows=625 width=24)
         Join Filter: a.lead_id = b.lead_id
         ->  Broadcast Motion 16:16  (slice1; segments: 16)  (cost=0.00..1816.00 rows=10000 width=8)
               ->  Seq Scan on leads b  (cost=0.00..116.00 rows=625 width=8)
         ->  Materialize  (cost=126.00..226.00 rows=625 width=16)
               ->  Seq Scan on sales a  (cost=0.00..116.00 rows=625 width=16)
 Settings:  optimizer=off
 Optimizer status: legacy query optimizer
(9 rows)

Resolution

1.) Turning on the Pivotal Query Optimizer will generate the following explain plan:

Gather Motion 16:1  (slice3; segments: 16)  (cost=0.00..863.04 rows=10000 width=24)
   ->  Hash Join  (cost=0.00..862.45 rows=625 width=24)
         Hash Cond: sales.lead_id = leads.lead_id::bigint
         ->  Redistribute Motion 16:16  (slice1; segments: 16)  (cost=0.00..431.07 rows=625 width=16)
               Hash Key: sales.lead_id
               ->  Table Scan on sales  (cost=0.00..431.02 rows=625 width=16)
         ->  Hash  (cost=431.04..431.04 rows=625 width=8)
               ->  Redistribute Motion 16:16  (slice2; segments: 16)  (cost=0.00..431.04 rows=625 width=8)
                     Hash Key: leads.lead_id::bigint
                     ->  Table Scan on leads  (cost=0.00..431.01 rows=625 width=8)
 Settings:  optimizer=on
 Optimizer status: PQO version 2.6.0
(12 rows)

2.) If you need to keep the optimizer off, the other workaround would be to explicitly cast the join columns to match the datatype:

testdb=# explain select * from sales a, leads b where a.lead_id = b.lead_id::bigint;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Gather Motion 16:1  (slice3; segments: 16)  (cost=441.00..2282.00 rows=100000 width=24)
   ->  Hash Join  (cost=441.00..2282.00 rows=6250 width=24)
         Hash Cond: a.lead_id = b.lead_id::bigint
         ->  Redistribute Motion 16:16  (slice1; segments: 16)  (cost=0.00..316.00 rows=625 width=16)
               Hash Key: a.lead_id
               ->  Seq Scan on sales a  (cost=0.00..116.00 rows=625 width=16)
         ->  Hash  (cost=316.00..316.00 rows=625 width=8)
               ->  Redistribute Motion 16:16  (slice2; segments: 16)  (cost=0.00..316.00 rows=625 width=8)
                     Hash Key: b.lead_id::bigint
                     ->  Seq Scan on leads b  (cost=0.00..116.00 rows=625 width=8)
 Settings:  optimizer=off
 Optimizer status: legacy query optimizer
(12 rows)