Customer encountering a problem with the Spark connector when they tried to write a table from Greenplum into HDFS. If a text column has a value which has both a pipe character AND ends with a backslash, it will be incorrectly quoted when written to the intermediate CSV format, and fail to import into HDFS correctly. Please find reproduction below, and the results of calling the .show() function on the dataset after reading from HDFS in Spark.
CREATE TABLE escape_bug_test ( id serial NOT NULL, name text NOT NULL, created timestamp with time zone, value text, CONSTRAINT escape_bug_test_pkey PRIMARY KEY (id) ) WITH (OIDS=FALSE) DISTRIBUTED BY (id);
INSERT INTO escape_bug_test VALUES (4, 'Foo\\', '2020-03-31 08:57:10.501594+11', 'bar'); INSERT INTO escape_bug_test VALUES (5, 'Must contain a | and have a backslash at the end\\', '2020-03-31 09:01:56.363635+11', 'Yes!'); INSERT INTO escape_bug_test VALUES (3, 'Test', '2020-03-31 08:55:14.986533+11', 'again'); INSERT INTO escape_bug_test VALUES (2, 'Test\\', '2020-03-31 08:45:12.542113+11', 'Not a date!');
After importing into HDFS and reading with Spark:
+---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+ |id |name |created |value | +---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+ |2 |Test\ |2020-03-31 08:45:12.542113|Not a date!| |3 |Test |2020-03-31 08:55:14.986533|again | |5 |Must contain a | and have a backslash at the end"|2020-03-31 09:01:56.363635+11|Yes! 4|Foo\|2020-03-31 08:57:10.501594+11|bar |null |null | +---+------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------+
You'll notice that although the table contains 4 rows, the Spark dataset only has 3, with the last row containing the bottom two rows in the 'name' column field.
Product Version: 5.21
In this case the spark connecter was used incorrectly. The connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters and is not for transferring data to HDFS. PXF can be used for this. Loading the same table from Greenplum into Spark directly with the connector worked as designed
scala> gpdf.show() +---+--------------------+--------------------+-----------+ | id| name| created| value| +---+--------------------+--------------------+-----------+ | 2| Test\\|2020-03-30 22:45:...|Not a date!| | 3| Test|2020-03-30 22:55:...| again| | 4| Foo\\|2020-03-30 22:57:...| bar| | 5|Must contain a | ...|2020-03-30 23:01:...| Yes!| +---+--------------------+--------------------+-----------+