PXF 6.5.0 - Changes in Handling of JSON Multi-line Files
search cancel

PXF 6.5.0 - Changes in Handling of JSON Multi-line Files

book

Article ID: 296849

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Previously, it was possible for a fragment to improperly parse a JSON object, especially those with special characters, if the split started in the middle of a string. This was causing wrong results, so as part of PXF 6.5.0, we have refactored the JsonRecordReader to internally use the LineRecordReader when handling multi-line JSON (JavaScript Object Notation) files.

It also resolves an issue with splittable compression codecs, like BZip2, that were producing duplicate rows.

Environment

Product Version: 6.21

Resolution

PXF introduces a new CREATE EXTERNAL TABLE option for the "*:json" profiles named SPLIT_BY_FILE that you can use to specify how PXF splits the data it reads. The default value is false, where PXF creates multiple splits for each file that will be processed in parallel. When set to true, PXF creates and processes a single split per file. This can have an impact on the performance of the query if there are few files to be read. 

If a user is getting incorrect data or results, then the user should create a table that contains the SPLIT_BY_FILE=true setting (default is false) in the LOCATION uri to try solving the issue. 

For example:  

CREATE READABLE EXTERNAL TABLE foo (a int, b int)  
LOCATION ('pxf://<data-location>?PROFILE=hdfs:json&SERVER=<server-name>&SPLIT_BY_FILE=true') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');  
If the change in setting resolves the issue, you may continue using the new settings when running the query. However, if there is significant impact on performance, we request you to open a new Support ticket with logs, ddl and – if possible -- a sample of the data to reproduce the issue.