This article addresses an intermittent performance degradation (approximately 25% of the time) observed when executing a CREATE TEMPORARY TABLE AS SELECT (CTAS) statement where the DISTRIBUTED BY clause is specified after the AS SELECT statement.
The following query exhibits inconsistent performance, with a noticeable slowdown in approximately one-quarter of its executions:
Example Query (Suboptimal Plan Possible)
SQL
CREATE TEMPORARY TABLE tmp_transactionid AS
SELECT DISTINCT transactionid
FROM source_schema.source_table
DISTRIBUTED BY (transactionid);
The performance issue is linked to how the legacy Greenplum Planner optimizes the statement. When the DISTRIBUTED BY clause is placed after the AS SELECT statement, the Planner may prioritize generating a plan for the inner SELECT query first, often based on the distribution key of the source table (source_schema.source_table) to minimize intermediate redistribution.
If this initial plan's intermediate data distribution doesn't match the final required distribution (DISTRIBUTED BY (transactionid)), the Planner is forced to insert a second, costly Redistribution Motion operator at the very end of the plan to enforce the desired distribution on the new temporary table. This double-redistribution or unoptimized final redistribution causes the performance bottleneck.
The performance inconsistency is resolved by reordering the CTAS statement to place the DISTRIBUTED BY clause before the AS SELECT statement.
Updated Query (Optimal Plan Generated)
SQL
CREATE TEMPORARY TABLE tmp_transactionid
DISTRIBUTED BY (transactionid)
AS
SELECT DISTINCT transactionid
FROM source_schema.source_table;
When the DISTRIBUTED BY clause is specified first, the legacy Planner is signaled early to make the final table's distribution key a primary constraint during the entire plan generation. This allows the optimizer to generate a single, more efficient plan that redistributes the data only once to satisfy the required DISTRIBUTED BY (transactionid) clause, thus eliminating the unnecessary second redistribution and resolving the performance issue.