How GPDB query optimizer chooses broadcasting or redistribution

Products

VMware Tanzu Greenplum

Issue/Introduction

BROADCASTING and REDISTRIBUTION are two main ways of handling data among segments for join of tables.

BROADCASTING:

All the table data is copied to all the segments. The data volume to be processed is V * n (V is the data volume of the table involved, and n is the number of segments).

REDISTRIBUTION:

Redistribute the data according to the join column. The data volume to be processed is V (V is the data volume of the table involved).

Resolution

For example, we have two tables:

create table tab1(id int, id2 int) distributed by(id);        --10000rows

create table tab2(id int, id2 int) distributed by(id);        --50000rows

insert into tab1 select generate_series(1,10000,1), generate_series(1,5000,2);

insert into tab2 select generate_series(1,50000,1), generate_series(1,5000,2);

analyze tab1;

analyze tab2;

Here, we will take inner join and left outer join as examples, for right/full outer join, the situation is similar. Let's assume the size of tab1 is M = 10000 rows and the size of tab2 is N = 50000 rows, the number of segments is n, in my environment, I have two acting primary segments, so n = 2.

Innerjoin

1. explain analyze select * from tab1 a, tab2 b where a.id = b.id;
In such case, there is no need for broadcasting or redistribution as each segment contains all of its data needed for join.

2. explain analyze select * from tab1 a, tab2 b where a.id = b.id2;
In such case, there are two ways to handle this:

Redistributing tab2 based on column id2, the cost will be N = 50000
Broadcasting tab1, the cost will be M * n = 10000 * 2 = 20000

As the second way is less costly, so the query optimizer will choose broadcasting tab1 as shown in the execution plan:

-> Broadcast Motion 2:2 (slice1; segments: 2) (cost=0.00..412.

00 rows=10000 width=8)

 Rows out: Avg 10000.0 rows x 2 workers at destination. Ma

x 10000 rows (seg0) with 1.176 ms to first row, 3.471 ms to end, start offset by

 149 ms.

 -> Seq Scan on tab1 a (cost=0.00..112.00 rows=5000 width=8)

If we change tab2 to 15000 rows, the respective cost will be:

Redistributing tab2 = N = 15000
Broadcasting tab1 = M * n = 10000 * 2 = 20000

Therefore, the query optimizer will choose redistributing tab2 as shown in the execution plan:

-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..468.00

rows=7500 width=8)

 Hash Key: b.id2

 Rows out: Avg 7500.0 rows x 2 workers at destination. Max 7680

rows (seg0) with 0.068 ms to first row, 7.304 ms to end, start offset by 25 ms.

 -> Seq Scan on tab2 b (cost=0.00..168.00 rows=7500 width=8)

 Rows out: Avg 7500.0 rows x 2 workers. Max 7500 rows (seg

0) with 0.113 ms to first row, 1.641 ms to end, start offset by 23 ms

3. explain analyze select * from tab1 a, tab2 b where a.id2 = b.id2;
In such case, there are two ways to handle this:

Redistributing both the two tables based on column id2, the cost will be M + N = 10000 + 50000 = 60000
Broadcasting the smaller table, that is tab1, the cost will be min(10000,50000) * n = 10000 * 2 = 20000

As the second way is less costly, so the query optimizer will choose broadcasting tab1 as shown in the execution plan:

-> Broadcast Motion 2:2 (slice1; segments: 2) (cost=0.00..412.

00 rows=10000 width=8)

 Rows out: Avg 10000.0 rows x 2 workers at destination. Ma

x 10000 rows (seg0) with 0.084 ms to first row, 2.347 ms to end, start offset by

 1.894 ms.

 -> Seq Scan on tab1 a (cost=0.00..112.00 rows=5000 width=8)

Therefore, the query optimizer will choose redistributing both tab1 and tab2.

So based on the tests above, we can see that on prerequisite for data correctness, query optimizer makes its choose on broadcasting or redistribution based on the number of rows that it has to process.

Limitation

This concept is working on legacy planner only so far, for orca, it may have different conclusion.