For example, we have a table:
gpadmin=# \d test_json_gpss Table "public.test_json_gpss" Column | Type | Modifiers --------+-------------------+----------- a | character varying | b | character varying | Distributed by: (a)
A yaml file:
$ cat test.yml DATABASE: gpadmin USER: gpadmin PASSWORD: "SHADOW:Z22OWXJNC#######################3QUCPWX7VJA" HOST: localhost PORT: 3018 VERSION: 2 FILE: INPUT: SOURCE: URL: file:///tmp/kafka_data.txt VALUE: COLUMNS: - NAME: meta TYPE: json FORMAT: json ERROR_LIMIT: 25 OUTPUT: SCHEMA: public TABLE: test_json_gpss MODE: insert MAPPING: - NAME: a EXPRESSION: (meta->>'a')::varchar(250) - NAME: b EXPRESSION: (meta->>'b')::varchar(250)
A data file:
$ cat kafka_data.txt {"a": 1234433, "b":9898668} {"a": 12343433, "b":9898668} {"a": 12344433433, "b":9898668} {"a": 1234412343233, "b":9898668} {"a": 1234411233, "b":9898668} {"a": 123445433, "b":9898668} {"a": 123445321133, "b":9898668} {"a": 123443333, "b":9898668} {"a": 1234442233, "b":9898668}
Then we can find gpsscli command fail to load anything:
$ gpsscli remove test-load --gpss-port 5019; gpsscli submit --name test-load test.yml --gpss-port 5019 ; gpsscli start test-load --gpss-port 5019; gpsscli progress test-load --gpss-port 5019 20220405 13:01:40 [INFO] Remove job test-load successfully JobName JobID Result Reason test-load bb47e40b50a9a1242a96c45cf02f8e30 success 20220405 13:01:40 [INFO] JobID: bb47e40b50a9a1242a96c45cf02f8e30,JobName: test-load 20220405 13:01:40 [INFO] Job test-load is started StartTime EndTime MsgNum MsgSize InsertedRecords RejectedRecords Speed Job test-load stopped with error at 2022-04-05 13:01:40.779672623 +0000 UTC, reason: job finished successfully -- No data is loaded gpadmin=# select * from test_json_gpss ; a | b ---------+--------- 1234433 | 9898668
From the error table, we can see that the gpss is not reading the data line by line, but rather all in one block, leading to "invalid input syntax for type json" error:
$ tail -50 gpsslogs/gpss_20220405.log | grep gp_read_persistent_error_log 20220405 12:55:51.80557,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9faee9641015e5a07ab32906e7b7b643""') WHERE cmdtime >= '2022-04-05 12:55:51.789+00' AND cmdtime <= '2022-04-05 12:55:51.805+00'" 20220405 12:56:11.70120,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9dfb1e09bac2e1fbca35b2d3db4d319d""') WHERE cmdtime >= '2022-04-05 12:56:11.682+00' AND cmdtime <= '2022-04-05 12:56:11.697+00'" 20220405 13:01:40.77192,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb""') WHERE cmdtime >= '2022-04-05 13:01:40.754+00' AND cmdtime <= '2022-04-05 13:01:40.771+00'" gpadmin=# SELECT cmdtime, errmsg, COALESCE(rawdata, encode(rawbytes, 'escape')) FROM gp_read_persistent_error_log('public.gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb') ; cmdtime | errmsg | coalesce -------------------------------+------------------------------------+----------------------------------- 2022-04-05 13:01:40.767395+00 | invalid input syntax for type json | {"a": 1234433, "b":9898668} + | | {"a": 12343433, "b":9898668} + | | {"a": 12344433433, "b":9898668} + | | {"a": 1234412343233, "b":9898668}+ | | {"a": 1234411233, "b":9898668} + | | {"a": 123445433, "b":9898668} + | | {"a": 123445321133, "b":9898668} + | | {"a": 123443333, "b":9898668} + | | {"a": 1234442233, "b":9898668}
The cause of the issue is that gpss is not handling multiple lines of json data, It reads the multiple lines as a single chunk of json data.