When attempting to load empty strings (like "") in data file into numeric fields (e.g. integer) via a PXF-based external table, it fails with the error message “data formatting errors.”
Error Message:
gpadmin=# insert into a1 select * from a1_ext1; NOTICE: Found 2 data formatting errors (2 or more input rows). Rejected related input data.
The failure was due to an invalid syntax for target integer type with an empty string ("") as input.
RCA
The internal HEAP table and external table are defined as follows:
gpadmin=# \d a1 Append-Only Table "public.a1" Column | Type | Modifiers --------+---------+----------- c1 | text | c2 | integer | c3 | text | gpadmin=# \d a1_ext1 External table "public.a1_ext1" Column | Type | Modifiers --------+---------+----------- c1 | text | c2 | integer | c3 | text | Type: readable Encoding: UTF8 Format type: csv Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple Segment reject limit: 10000 rows Error table: err_a1_ext1
And, raw data in source data file (ea1.dat) looks like below one:
[gpadmin@hdm2 50074]$ cat ea1.dat "12354.0","","N/A"
Since data loading is rejected, it will be logged into the error table, from which the following information could be queried out:
gpadmin=# select * from err_a1_ext1; cmdtime | relname | filename | linenum | bytenum | errmsg | rawdata | rawbytes ------------------------------+---------+--------------------------------------------------------------------+---------+---------+----------------------------------------------- -----------------+--------------------+---------- 2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 1 | | invalid input syntax for integer: "", column c 2 | "12354.0","","N/A" | 2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 2 | | missing data for column "c2", found empty data line, column c2 | | (2 rows)
It's apparent that the rejection reason is invalid input syntax for integer: "". This means HDB internally has no idea how to convert an empty string to some valid numeric value.
Since auto type conversion is not possible for HDB, some explicit conversion could be done before the empty string is passed into HDB to do an internal conversion.
1. Specify "text" type for empty string filed in the external table definition:
gpadmin=# \d a1_ext External table "public.a1_ext" Column | Type | Modifiers --------+------+----------- c1 | text | c2 | text | c3 | text | Type: readable Encoding: UTF8 Format type: csv Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple Segment reject limit: 10000 rows Error table: err_a1_ext
2. Convert the empty string with NULLIF() function before inserting data to internal heap table:
gpadmin=# insert into a1 select c1, nullif(c2,'')::integer, c3 from a1_ext; INSERT 0 1 gpadmin=# select * from a1; c1 | c2 | c3 ---------+----+----- 12354.0 | | N/A (1 row)
Instead of using NULLIF(), the alternative could be writing a UDF which converts empty string to NULL or 0 first.