For example, we have a table:
gpadmin=# \d test_json_gpss
Table "public.test_json_gpss"
Column | Type | Modifiers
--------+-------------------+-----------
a | character varying |
b | character varying |
Distributed by: (a)
A yaml file:
$ cat test.yml
DATABASE: gpadmin
USER: gpadmin
PASSWORD: "SHADOW:Z22OWXJNC#######################3QUCPWX7VJA"
HOST: localhost
PORT: 3018
VERSION: 2
FILE:
INPUT:
SOURCE:
URL: file:///tmp/kafka_data.txt
VALUE:
COLUMNS:
- NAME: meta
TYPE: json
FORMAT: json
ERROR_LIMIT: 25
OUTPUT:
SCHEMA: public
TABLE: test_json_gpss
MODE: insert
MAPPING:
- NAME: a
EXPRESSION: (meta->>'a')::varchar(250)
- NAME: b
EXPRESSION: (meta->>'b')::varchar(250)
A data file:
$ cat kafka_data.txt
{"a": 1234433, "b":9898668}
{"a": 12343433, "b":9898668}
{"a": 12344433433, "b":9898668}
{"a": 1234412343233, "b":9898668}
{"a": 1234411233, "b":9898668}
{"a": 123445433, "b":9898668}
{"a": 123445321133, "b":9898668}
{"a": 123443333, "b":9898668}
{"a": 1234442233, "b":9898668}
Then we can find gpsscli command fail to load anything:
$ gpsscli remove test-load --gpss-port 5019; gpsscli submit --name test-load test.yml --gpss-port 5019 ; gpsscli start test-load --gpss-port 5019; gpsscli progress test-load --gpss-port 5019
20220405 13:01:40 [INFO] Remove job test-load successfully
JobName JobID Result Reason
test-load bb47e40b50a9a1242a96c45cf02f8e30 success
20220405 13:01:40 [INFO] JobID: bb47e40b50a9a1242a96c45cf02f8e30,JobName: test-load
20220405 13:01:40 [INFO] Job test-load is started
StartTime EndTime MsgNum MsgSize InsertedRecords RejectedRecords Speed
Job test-load stopped with error at 2022-04-05 13:01:40.779672623 +0000 UTC, reason: job finished successfully
-- No data is loaded
gpadmin=# select * from test_json_gpss ;
a | b
---------+---------
1234433 | 9898668
From the error table, we can see that the gpss is not reading the data line by line, but rather all in one block, leading to "invalid input syntax for type json" error:
$ tail -50 gpsslogs/gpss_20220405.log | grep gp_read_persistent_error_log
20220405 12:55:51.80557,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9faee9641015e5a07ab32906e7b7b643""') WHERE cmdtime >= '2022-04-05 12:55:51.789+00' AND cmdtime <= '2022-04-05 12:55:51.805+00'"
20220405 12:56:11.70120,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9dfb1e09bac2e1fbca35b2d3db4d319d""') WHERE cmdtime >= '2022-04-05 12:56:11.682+00' AND cmdtime <= '2022-04-05 12:56:11.697+00'"
20220405 13:01:40.77192,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb""') WHERE cmdtime >= '2022-04-05 13:01:40.754+00' AND cmdtime <= '2022-04-05 13:01:40.771+00'"
gpadmin=# SELECT cmdtime, errmsg, COALESCE(rawdata, encode(rawbytes, 'escape')) FROM gp_read_persistent_error_log('public.gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb') ;
cmdtime | errmsg | coalesce
-------------------------------+------------------------------------+-----------------------------------
2022-04-05 13:01:40.767395+00 | invalid input syntax for type json | {"a": 1234433, "b":9898668} +
| | {"a": 12343433, "b":9898668} +
| | {"a": 12344433433, "b":9898668} +
| | {"a": 1234412343233, "b":9898668}+
| | {"a": 1234411233, "b":9898668} +
| | {"a": 123445433, "b":9898668} +
| | {"a": 123445321133, "b":9898668} +
| | {"a": 123443333, "b":9898668} +
| | {"a": 1234442233, "b":9898668}
The cause of the issue is that gpss is not handling multiple lines of json data, It reads the multiple lines as a single chunk of json data.