When executing a query between the two tables where the join columns are of different data types, there is a noticeable performance lag compared to joining the columns of the same data types. The following is an example of a join between a bigint and a regular integer:
create table sales (id integer, lead_id integer, rand_num integer) distributed by (id);
create table leads (lead_id integer, rand_num integer) distributed by (lead_id);
testdb=# \d sales;
Table "public.sales"
Column | Type | Modifiers
----------+---------+-----------
id | integer |
lead_id | integer |
rand_num | integer |
Distributed by: (id)
testdb=# \d leads;
Table "public.leads"
Column | Type | Modifiers
----------+---------+-----------
lead_id | integer |
rand_num | integer |
Distributed by: (lead_id)
Insert some test data into the tables:
insert into sales SELECT
generate_series (1,10000), (random()*2000::int), (random()*200::int)
FROM
generate_series (1,1) AS x(n) ;
insert into leads SELECT
generate_series (1,10000), (random()*2000::int)
FROM
generate_series (1,1) AS x(n);
With the tables joined on lead_id with no datatype mismatch, the execution time is about the same:
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
NOTICE: One or more columns in the following table(s) do not have statistics: sales
HINT: For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
id | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
31 | 666 | 141 | 666 | 1976
942 | 1052 | 52 | 1052 | 1941
974 | 933 | 43 | 933 | 128
1027 | 1654 | 27 | 1654 | 225
Time: 542.267 ms
testdb=# set optimizer=off;
SET
Time: 142.155 ms
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
id | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
324 | 1213 | 82 | 1213 | 403
569 | 980 | 168 | 980 | 357
681 | 687 | 6 | 687 | 1699
926 | 804 | 34 | 804 | 312
1091 | 1635 | 143 | 1635 | 1984
Time: 178.780 ms
However, when you change the lead_id in the sales table to a bigint, the execution time changes for the query with the optimizer off:
testdb=# alter table sales alter lead_id type bigint;
ALTER TABLE
testdb=# set optimizer=on;
SET
Time: 5.063 ms
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
id | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
3591 | 16 | 179 | 16 | 1590
3024 | 32 | 4 | 32 | 514
3639 | 32 | 91 | 32 | 514
7217 | 32 | 146 | 32 | 514
4578 | 64 | 198 | 64 | 1680
Time: 201.281 ms
testdb=# set optimizer=off;
SET
Time: 5.063 ms
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id;
id | lead_id | rand_num | lead_id | rand_num
-------+---------+----------+---------+----------
2907 | 2 | 193 | 2 | 1830
1182 | 34 | 38 | 34 | 1781
8477 | 34 | 158 | 34 | 1781
7238 | 50 | 103 | 50 | 463
5764 | 130 | 40 | 130 | 1057
Time: 2106.507 ms
As you can see with the optimizer off, a simple join between columns of different data types is about 10x slower in the test case of 10000 rows of data.
When you run an explain analyze on the select query, you'll see that the legacy optimizer chooses to use a nested loop to perform the join on columns of different data types instead of a hash join. Nested loops are known to be a more resource intensive operation and it is encouraged to avoid them when working with larger datasets.
Explain plan for no datatype mismatch:
Gather Motion 16:1 (slice2; segments: 16) (cost=241.00..707.00 rows=10000 width=20)
-> Hash Join (cost=241.00..707.00 rows=625 width=20)
Hash Cond: a.lead_id = b.lead_id
-> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..316.00 rows=625 width=12)
Hash Key: a.lead_id
-> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=12)
-> Hash (cost=116.00..116.00 rows=625 width=8)
-> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8)
Settings: optimizer=off
Optimizer status: legacy query optimizer
(10 rows)
Data Type mismatch:
Gather Motion 16:1 (slice2; segments: 16) (cost=126.00..36001942.00 rows=10000 width=24)
-> Nested Loop (cost=126.00..36001942.00 rows=625 width=24)
Join Filter: a.lead_id = b.lead_id
-> Broadcast Motion 16:16 (slice1; segments: 16) (cost=0.00..1816.00 rows=10000 width=8)
-> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8)
-> Materialize (cost=126.00..226.00 rows=625 width=16)
-> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=16)
Settings: optimizer=off
Optimizer status: legacy query optimizer
(9 rows)
1.) Turning on the Pivotal Query Optimizer will generate the following explain plan:
Gather Motion 16:1 (slice3; segments: 16) (cost=0.00..863.04 rows=10000 width=24)
-> Hash Join (cost=0.00..862.45 rows=625 width=24)
Hash Cond: sales.lead_id = leads.lead_id::bigint
-> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..431.07 rows=625 width=16)
Hash Key: sales.lead_id
-> Table Scan on sales (cost=0.00..431.02 rows=625 width=16)
-> Hash (cost=431.04..431.04 rows=625 width=8)
-> Redistribute Motion 16:16 (slice2; segments: 16) (cost=0.00..431.04 rows=625 width=8)
Hash Key: leads.lead_id::bigint
-> Table Scan on leads (cost=0.00..431.01 rows=625 width=8)
Settings: optimizer=on
Optimizer status: PQO version 2.6.0
(12 rows)
2.) If you need to keep the optimizer off, the other workaround would be to explicitly cast the join columns to match the datatype:
testdb=# explain select * from sales a, leads b where a.lead_id = b.lead_id::bigint;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Gather Motion 16:1 (slice3; segments: 16) (cost=441.00..2282.00 rows=100000 width=24)
-> Hash Join (cost=441.00..2282.00 rows=6250 width=24)
Hash Cond: a.lead_id = b.lead_id::bigint
-> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..316.00 rows=625 width=16)
Hash Key: a.lead_id
-> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=16)
-> Hash (cost=316.00..316.00 rows=625 width=8)
-> Redistribute Motion 16:16 (slice2; segments: 16) (cost=0.00..316.00 rows=625 width=8)
Hash Key: b.lead_id::bigint
-> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8)
Settings: optimizer=off
Optimizer status: legacy query optimizer
(12 rows)