Can I use a pipe and backslash symbol in a text Column when reading into Spark from Greenplum
search cancel

Can I use a pipe and backslash symbol in a text Column when reading into Spark from Greenplum

book

Article ID: 296476

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Customer encountering a problem with the Spark connector when they tried to write a table from Greenplum into HDFS. If a text column has a value which has both a pipe character AND ends with a backslash, it will be incorrectly quoted when written to the intermediate CSV format, and fail to import into HDFS correctly. Please find reproduction below, and the results of calling the .show() function on the dataset after reading from HDFS in Spark.
 

CREATE TABLE escape_bug_test ( id serial NOT NULL, name text NOT NULL, created timestamp with time zone, value text, CONSTRAINT escape_bug_test_pkey PRIMARY KEY (id) ) WITH (OIDS=FALSE) DISTRIBUTED BY (id); 
INSERT INTO escape_bug_test VALUES (4, 'Foo\\', '2020-03-31 08:57:10.501594+11', 'bar'); 

INSERT INTO escape_bug_test VALUES (5, 'Must contain a | and have a backslash at the end\\', '2020-03-31 09:01:56.363635+11', 'Yes!'); 

INSERT INTO escape_bug_test VALUES (3, 'Test', '2020-03-31 08:55:14.986533+11', 'again'); 

INSERT INTO escape_bug_test VALUES (2, 'Test\\', '2020-03-31 08:45:12.542113+11', 'Not a date!'); 


After importing into HDFS and reading with Spark:

+---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+
|id |name                                                                                                                          |created                   |value      |
+---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+
|2  |Test\                                                                                                                         |2020-03-31 08:45:12.542113|Not a date!|
|3  |Test                                                                                                                          |2020-03-31 08:55:14.986533|again      |
|5  |Must contain a | and have a backslash at the end"|2020-03-31 09:01:56.363635+11|Yes!
4|Foo\|2020-03-31 08:57:10.501594+11|bar
|null                      |null       |
+---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+


You'll notice that although the table contains 4 rows, the Spark dataset only has 3, with the last row containing the bottom two rows in the 'name' column field.

Environment

Product Version: 5.21

Resolution

In this case the spark connecter was used incorrectly. The connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters and is not for transferring data to HDFS. PXF can be used for this. Loading the same table from Greenplum into Spark directly with the connector worked as designed
 

scala> gpdf.show()

+---+--------------------+--------------------+-----------+                     

| id|                name|             created|      value|

+---+--------------------+--------------------+-----------+

|  2|              Test\\|2020-03-30 22:45:...|Not a date!|

|  3|                Test|2020-03-30 22:55:...|      again|

|  4|               Foo\\|2020-03-30 22:57:...|        bar|

|  5|Must contain a | ...|2020-03-30 23:01:...|       Yes!|

+---+--------------------+--------------------+-----------+