Greenplum intermittent performance degradation in CTAS: Avoiding Suboptimal Plans
search cancel

Greenplum intermittent performance degradation in CTAS: Avoiding Suboptimal Plans

book

Article ID: 418154

calendar_today

Updated On:

Products

VMware Tanzu Data Suite VMware Tanzu Greenplum

Issue/Introduction

This article addresses an intermittent performance degradation (approximately 25% of the time) observed when executing a CREATE TEMPORARY TABLE AS SELECT (CTAS) statement where the DISTRIBUTED BY clause is specified after the AS SELECT statement.

The following query exhibits inconsistent performance, with a noticeable slowdown in approximately one-quarter of its executions:

Example Query (Suboptimal Plan Possible)

SQL

CREATE TEMPORARY TABLE tmp_transactionid AS
  SELECT DISTINCT transactionid
  FROM source_schema.source_table
  DISTRIBUTED BY (transactionid);

Environment

  • Product: Greenplum Database
  • Version: 6.28.2 (and earlier versions utilizing the legacy Planner optimizer)

Cause

The performance issue is linked to how the legacy Greenplum Planner optimizes the statement. When the DISTRIBUTED BY clause is placed after the AS SELECT statement, the Planner may prioritize generating a plan for the inner SELECT query first, often based on the distribution key of the source table (source_schema.source_table) to minimize intermediate redistribution.

If this initial plan's intermediate data distribution doesn't match the final required distribution (DISTRIBUTED BY (transactionid)), the Planner is forced to insert a second, costly Redistribution Motion operator at the very end of the plan to enforce the desired distribution on the new temporary table. This double-redistribution or unoptimized final redistribution causes the performance bottleneck.

Resolution

The performance inconsistency is resolved by reordering the CTAS statement to place the DISTRIBUTED BY clause before the AS SELECT statement.

Updated Query (Optimal Plan Generated)

SQL

CREATE TEMPORARY TABLE tmp_transactionid
  DISTRIBUTED BY (transactionid)
AS
  SELECT DISTINCT transactionid
  FROM source_schema.source_table;

When the DISTRIBUTED BY clause is specified first, the legacy Planner is signaled early to make the final table's distribution key a primary constraint during the entire plan generation. This allows the optimizer to generate a single, more efficient plan that redistributes the data only once to satisfy the required DISTRIBUTED BY (transactionid) clause, thus eliminating the unnecessary second redistribution and resolving the performance issue.

Additional Information

References:

Tanzu Greenplum GPORCA

Tanzu Greenplum Architecture

Tanzu Greenplum CREATE TABLE AS Syntax

Tanzu Greenplum SQL Commands

Tanzu Greenplum: Determining the Query Optimizer that is Used