We will get unrecognizable code when querying the external HDFS table via PXF, with the below configuration:
- The source file on HDFS encodes with some encoding other than UTF-8 (so far we observed LATIN1 will have this issue but there could be more.)
- the source file has some special character like x²
- The PXF is running with code version at 5.15 and 5.16
Below is an example of this issue:
- upload the CSV file to HDFS, noted that the CSV in problem was encoded with LATIN1 (ISO-8859)
# file test_file.csv
cpsconfig.csv: ISO-8859 text, with CRLF line terminators
// some content of csv
aaaaa,xxxxxx,,xxxxxx
bbbbb,xxxxxx,in/s²,xxxxxx
ccccc,xxxxxx,ft/s²,xxxxxx
- On GP, create the external table
CREATE EXTERNAL TABLE cpsconfig (
property text,
context text,
property_value text,
scope text
)
LOCATION ('pxf://greenplum/pxf_test/test_csv/test_file.csv?PROFILE=hdfs:text&SERVER=hdfs')
FORMAT 'CSV' ( delimiter ',' null '\N' escape '' quote '"' ) ENCODING 'LATIN1';
- select the data via PXF, result: failed to display x²
# psql -c "select * from cpsconfig limit 3;"
property | context | property_value | scope
-----------+---------+----------------+---------
aaaaa | xxxxxx | | xxxxxx
bbbbb | xxxxxx | in/s� | xxxxxx
ccccc | xxxxxx | ft/s� | xxxxxx
# pxf version
PXF version 5.16.0