Failed to Load Empty String from External Table into Numeric Field
search cancel

Failed to Load Empty String from External Table into Numeric Field

book

Article ID: 295737

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

When attempting to load empty strings (like "") in data file into numeric fields (e.g. integer) via a PXF-based external table, it fails with the error message “data formatting errors.”

Error Message:

gpadmin=# insert into a1 select * from a1_ext1;
NOTICE: Found 2 data formatting errors (2 or more input rows). Rejected related input data.

Environment


Cause

The failure was due to an invalid syntax for target integer type with an empty string ("") as input.  

RCA

The internal HEAP table and external table are defined as follows:

gpadmin=# \d a1
Append-Only Table "public.a1"
 Column | Type | Modifiers
--------+---------+-----------
 c1 | text |
 c2 | integer |
 c3 | text |
gpadmin=# \d a1_ext1
External table "public.a1_ext1"
 Column | Type | Modifiers
--------+---------+-----------
 c1 | text |
 c2 | integer |
 c3 | text |
Type: readable
Encoding: UTF8
Format type: csv
Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields
External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple
Segment reject limit: 10000 rows
Error table: err_a1_ext1

And, raw data in source data file (ea1.dat) looks like below one:

[gpadmin@hdm2 50074]$ cat ea1.dat
"12354.0","","N/A"

Since data loading is rejected, it will be logged into the error table, from which the following information could be queried out:

gpadmin=# select * from err_a1_ext1;
 cmdtime | relname | filename | linenum | bytenum | errmsg
 | rawdata | rawbytes
------------------------------+---------+--------------------------------------------------------------------+---------+---------+-----------------------------------------------
-----------------+--------------------+----------
 2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 1 | | invalid input syntax for integer: "", column c
2 | "12354.0","","N/A" |
 2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 2 | | missing data for column "c2", found empty data
 line, column c2 | |
(2 rows)

It's apparent that the rejection reason is invalid input syntax for integer: "". This means HDB internally has no idea how to convert an empty string to some valid numeric value.

 

Resolution

Since auto type conversion is not possible for HDB, some explicit conversion could be done before the empty string is passed into HDB to do an internal conversion. 

1. Specify "text" type for empty string filed in the external table definition:

gpadmin=# \d a1_ext
External table "public.a1_ext"
 Column | Type | Modifiers
--------+------+-----------
 c1 | text |
 c2 | text |
 c3 | text |
Type: readable
Encoding: UTF8
Format type: csv
Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields
External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple
Segment reject limit: 10000 rows
Error table: err_a1_ext

2. Convert the empty string with NULLIF() function before inserting data to internal heap table:

gpadmin=# insert into a1 select c1, nullif(c2,'')::integer, c3 from a1_ext;
INSERT 0 1
gpadmin=# select * from a1;
 c1 | c2 | c3
---------+----+-----
 12354.0 | | N/A
(1 row)

Instead of using NULLIF(), the alternative could be writing a UDF which converts empty string to NULL or 0 first.