How to read a view in a Greenplum Database using the Spark-connector

search cancel

How to read a view in a Greenplum Database using the Spark-connector

book

Article ID: 296764

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

When accessing a VIEW using the spark connector the following error is reported:
ERROR: column "gp_segment_id" does not exist

If the column "gp_segment_id" is added to the view, the following error will be reported:
ERROR: column name "gp_segment_id" conflicts with a system column name

Environment

Product Version: 6.18

Resolution

NOTE: This only applies to standard view, it does NOT apply to materialised views.

When reading data from Greenplum into Spark, the data has to be split into chunks that can be operated on in parallel. When reading from a table, an obvious choice for splitting the data is by the value of gp_segment_id. The connector will by default use gp_segment_id to do this.

However, views don't have the column gp_segment_id, Therefore there is some extra configuration required to read views.

Manually specify a partitioning strategy (see Specifying Partition Options in the connector documentation).

Define the view to include gp_segment_id (e.g., CREATE VIEW ... AS SELECT gp_segment_id, ... FROM ...). With this second approach, any query that requests all columns in the dataframe/view will fail (e.g., df.show()) but any query that includes column projection (e.g., df.select("col1", "col2").where("col3 > 123.45").show()) will succeed.

Feedback

thumb_up Yes

thumb_down No