When executing a query between the two tables where the join columns are of different data types, there is a noticeable performance lag compared to joining the columns of the same data types. The following is an example of a join between a bigint and a regular integer:
create table sales (id integer, lead_id integer, rand_num integer) distributed by (id); create table leads (lead_id integer, rand_num integer) distributed by (lead_id); testdb=# \d sales; Table "public.sales" Column | Type | Modifiers ----------+---------+----------- id | integer | lead_id | integer | rand_num | integer | Distributed by: (id) testdb=# \d leads; Table "public.leads" Column | Type | Modifiers ----------+---------+----------- lead_id | integer | rand_num | integer | Distributed by: (lead_id)
Insert some test data into the tables:
insert into sales SELECT generate_series (1,10000), (random()*2000::int), (random()*200::int) FROM generate_series (1,1) AS x(n) ; insert into leads SELECT generate_series (1,10000), (random()*2000::int) FROM generate_series (1,1) AS x(n);
With the tables joined on lead_id with no datatype mismatch, the execution time is about the same:
testdb=# select * from sales a, leads b where a.lead_id = b.lead_id; NOTICE: One or more columns in the following table(s) do not have statistics: sales HINT: For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics. id | lead_id | rand_num | lead_id | rand_num -------+---------+----------+---------+---------- 31 | 666 | 141 | 666 | 1976 942 | 1052 | 52 | 1052 | 1941 974 | 933 | 43 | 933 | 128 1027 | 1654 | 27 | 1654 | 225 Time: 542.267 ms testdb=# set optimizer=off; SET Time: 142.155 ms testdb=# select * from sales a, leads b where a.lead_id = b.lead_id; id | lead_id | rand_num | lead_id | rand_num -------+---------+----------+---------+---------- 324 | 1213 | 82 | 1213 | 403 569 | 980 | 168 | 980 | 357 681 | 687 | 6 | 687 | 1699 926 | 804 | 34 | 804 | 312 1091 | 1635 | 143 | 1635 | 1984 Time: 178.780 ms However, when you change the lead_id in the sales table to a bigint, the execution time changes for the query with the optimizer off: testdb=# alter table sales alter lead_id type bigint; ALTER TABLE testdb=# set optimizer=on; SET Time: 5.063 ms testdb=# select * from sales a, leads b where a.lead_id = b.lead_id; id | lead_id | rand_num | lead_id | rand_num -------+---------+----------+---------+---------- 3591 | 16 | 179 | 16 | 1590 3024 | 32 | 4 | 32 | 514 3639 | 32 | 91 | 32 | 514 7217 | 32 | 146 | 32 | 514 4578 | 64 | 198 | 64 | 1680 Time: 201.281 ms testdb=# set optimizer=off; SET Time: 5.063 ms testdb=# select * from sales a, leads b where a.lead_id = b.lead_id; id | lead_id | rand_num | lead_id | rand_num -------+---------+----------+---------+---------- 2907 | 2 | 193 | 2 | 1830 1182 | 34 | 38 | 34 | 1781 8477 | 34 | 158 | 34 | 1781 7238 | 50 | 103 | 50 | 463 5764 | 130 | 40 | 130 | 1057 Time: 2106.507 ms
As you can see with the optimizer off, a simple join between columns of different data types is about 10x slower in the test case of 10000 rows of data.
When you run an explain analyze on the select query, you'll see that the legacy optimizer chooses to use a nested loop to perform the join on columns of different data types instead of a hash join. Nested loops are known to be a more resource intensive operation and it is encouraged to avoid them when working with larger datasets.
Explain plan for no datatype mismatch:
Gather Motion 16:1 (slice2; segments: 16) (cost=241.00..707.00 rows=10000 width=20) -> Hash Join (cost=241.00..707.00 rows=625 width=20) Hash Cond: a.lead_id = b.lead_id -> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..316.00 rows=625 width=12) Hash Key: a.lead_id -> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=12) -> Hash (cost=116.00..116.00 rows=625 width=8) -> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8) Settings: optimizer=off Optimizer status: legacy query optimizer (10 rows)
Data Type mismatch:
Gather Motion 16:1 (slice2; segments: 16) (cost=126.00..36001942.00 rows=10000 width=24) -> Nested Loop (cost=126.00..36001942.00 rows=625 width=24) Join Filter: a.lead_id = b.lead_id -> Broadcast Motion 16:16 (slice1; segments: 16) (cost=0.00..1816.00 rows=10000 width=8) -> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8) -> Materialize (cost=126.00..226.00 rows=625 width=16) -> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=16) Settings: optimizer=off Optimizer status: legacy query optimizer (9 rows)
1.) Turning on the Pivotal Query Optimizer will generate the following explain plan:
Gather Motion 16:1 (slice3; segments: 16) (cost=0.00..863.04 rows=10000 width=24) -> Hash Join (cost=0.00..862.45 rows=625 width=24) Hash Cond: sales.lead_id = leads.lead_id::bigint -> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..431.07 rows=625 width=16) Hash Key: sales.lead_id -> Table Scan on sales (cost=0.00..431.02 rows=625 width=16) -> Hash (cost=431.04..431.04 rows=625 width=8) -> Redistribute Motion 16:16 (slice2; segments: 16) (cost=0.00..431.04 rows=625 width=8) Hash Key: leads.lead_id::bigint -> Table Scan on leads (cost=0.00..431.01 rows=625 width=8) Settings: optimizer=on Optimizer status: PQO version 2.6.0 (12 rows)
2.) If you need to keep the optimizer off, the other workaround would be to explicitly cast the join columns to match the datatype:
testdb=# explain select * from sales a, leads b where a.lead_id = b.lead_id::bigint; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Gather Motion 16:1 (slice3; segments: 16) (cost=441.00..2282.00 rows=100000 width=24) -> Hash Join (cost=441.00..2282.00 rows=6250 width=24) Hash Cond: a.lead_id = b.lead_id::bigint -> Redistribute Motion 16:16 (slice1; segments: 16) (cost=0.00..316.00 rows=625 width=16) Hash Key: a.lead_id -> Seq Scan on sales a (cost=0.00..116.00 rows=625 width=16) -> Hash (cost=316.00..316.00 rows=625 width=8) -> Redistribute Motion 16:16 (slice2; segments: 16) (cost=0.00..316.00 rows=625 width=8) Hash Key: b.lead_id::bigint -> Seq Scan on leads b (cost=0.00..116.00 rows=625 width=8) Settings: optimizer=off Optimizer status: legacy query optimizer (12 rows)