PXF can not display some special character (x²) with encoding other than UTF-8
search cancel

PXF can not display some special character (x²) with encoding other than UTF-8

book

Article ID: 296584

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

We will get unrecognizable code when querying the external HDFS table via PXF, with the below configuration:
  • The source file on HDFS encodes with some encoding other than UTF-8 (so far we observed LATIN1 will have this issue but there could be more.)
  • the source file has some special character like x² 
  • The PXF is running with code version at 5.15 and 5.16
Below is an example of this issue:

- upload the CSV file to HDFS, noted that the CSV in problem was encoded with LATIN1 (ISO-8859)
# file test_file.csv
cpsconfig.csv: ISO-8859 text, with CRLF line terminators

// some content of csv
aaaaa,xxxxxx,,xxxxxx
bbbbb,xxxxxx,in/s²,xxxxxx
ccccc,xxxxxx,ft/s²,xxxxxx
- On GP, create the external table 
CREATE EXTERNAL TABLE  cpsconfig (
property text,
context text,
property_value text,
scope text
)
LOCATION ('pxf://greenplum/pxf_test/test_csv/test_file.csv?PROFILE=hdfs:text&SERVER=hdfs')
FORMAT 'CSV' ( delimiter ',' null '\N' escape '' quote '"' ) ENCODING 'LATIN1';
- select the data via PXF, result: failed to display
# psql -c "select * from cpsconfig limit 3;"
 property  | context | property_value |  scope
-----------+---------+----------------+---------
 aaaaa     | xxxxxx  |                | xxxxxx
 bbbbb     | xxxxxx  | in/s�        | xxxxxx
 ccccc     | xxxxxx  | ft/s�        | xxxxxx

# pxf version 
PXF version 5.16.0


Environment

Product Version: 5.28

Resolution

The issue is going to be fixed at PXF 5.16.2

Workaround:

1. Copy pxf-site.xml to server configuration if it doesn’t already exist
$ cp $PXF_CONF/templates/pxf-site.xml $PXF_CONF/servers/hdfs/
2. Add the below property into the new XML file
    <property>
        <name>pxf.reader.chunk-record-reader.enabled</name>
        <value>true</value>
    </property>
3. test the query again:
gpadmin=# SELECT * from cpsconfig limit 3;
 property  | context | property_value |  scope
-----------+---------+----------------+---------
 aaaaa     | xxxxxx  |                | xxxxxx
 bbbbb     | xxxxxx  | in/s²          | xxxxxx
 ccccc     | xxxxxx  | ft/s²          | xxxxxx
(3 rows)