gpss failed to handle source file with multiple lines of json data with error message "invalid input syntax for type json"
search cancel

gpss failed to handle source file with multiple lines of json data with error message "invalid input syntax for type json"

book

Article ID: 296753

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

For example, we have a table:

gpadmin=# \d test_json_gpss
     Table "public.test_json_gpss"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 a      | character varying |
 b      | character varying |
Distributed by: (a)

A yaml file:

$ cat test.yml
DATABASE: gpadmin
USER: gpadmin
PASSWORD: "SHADOW:Z22OWXJNC#######################3QUCPWX7VJA"
HOST: localhost
PORT: 3018
VERSION: 2
FILE:
    INPUT:
       SOURCE:
         URL: file:///tmp/kafka_data.txt
       VALUE:
         COLUMNS:
            - NAME: meta
              TYPE: json
         FORMAT: json
       ERROR_LIMIT: 25
    OUTPUT:
       SCHEMA: public
       TABLE: test_json_gpss
       MODE: insert
       MAPPING:
          - NAME: a
            EXPRESSION: (meta->>'a')::varchar(250)
          - NAME: b
            EXPRESSION: (meta->>'b')::varchar(250)

A data file:

$ cat kafka_data.txt
{"a": 1234433, "b":9898668}
{"a": 12343433, "b":9898668}
{"a": 12344433433, "b":9898668}
{"a": 1234412343233, "b":9898668}
{"a": 1234411233, "b":9898668}
{"a": 123445433, "b":9898668}
{"a": 123445321133, "b":9898668}
{"a": 123443333, "b":9898668}
{"a": 1234442233, "b":9898668}

Then we can find gpsscli command fail to load anything:

$ gpsscli remove test-load --gpss-port 5019; gpsscli submit --name test-load test.yml --gpss-port 5019 ; gpsscli start test-load --gpss-port 5019; gpsscli progress test-load --gpss-port 5019
20220405 13:01:40 [INFO] Remove job test-load successfully
JobName                             JobID                               Result  Reason
test-load                           bb47e40b50a9a1242a96c45cf02f8e30    success
20220405 13:01:40 [INFO] JobID: bb47e40b50a9a1242a96c45cf02f8e30,JobName: test-load
20220405 13:01:40 [INFO] Job test-load is started
StartTime			EndTime				MsgNum		MsgSize		InsertedRecords	RejectedRecords	Speed
Job test-load stopped with error at 2022-04-05 13:01:40.779672623 +0000 UTC, reason: job finished successfully

-- No data is loaded

gpadmin=# select * from test_json_gpss ;
    a    |    b
---------+---------
 1234433    | 9898668

From the error table, we can see that the gpss is not reading the data line by line, but rather all in one block, leading to "invalid input syntax for type json" error:

$ tail -50 gpsslogs/gpss_20220405.log  | grep gp_read_persistent_error_log
20220405 12:55:51.80557,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9faee9641015e5a07ab32906e7b7b643""') WHERE cmdtime >= '2022-04-05 12:55:51.789+00' AND cmdtime <= '2022-04-05 12:55:51.805+00'"
20220405 12:56:11.70120,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_9dfb1e09bac2e1fbca35b2d3db4d319d""') WHERE cmdtime >= '2022-04-05 12:56:11.682+00' AND cmdtime <= '2022-04-05 12:56:11.697+00'"
20220405 13:01:40.77192,18660,debug,"SELECT COUNT(*) FROM gp_read_persistent_error_log('""public"".""gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb""') WHERE cmdtime >= '2022-04-05 13:01:40.754+00' AND cmdtime <= '2022-04-05 13:01:40.771+00'"

gpadmin=# SELECT cmdtime, errmsg, COALESCE(rawdata, encode(rawbytes, 'escape')) FROM gp_read_persistent_error_log('public.gpfileloadext_985080d41e8a0600ed3b1f266fbd01fb') ;
            cmdtime            |               errmsg               |             coalesce
-------------------------------+------------------------------------+-----------------------------------
 2022-04-05 13:01:40.767395+00 | invalid input syntax for type json | {"a": 1234433, "b":9898668}      +
                               |                                    | {"a": 12343433, "b":9898668}     +
                               |                                    | {"a": 12344433433, "b":9898668}  +
                               |                                    | {"a": 1234412343233, "b":9898668}+
                               |                                    | {"a": 1234411233, "b":9898668}   +
                               |                                    | {"a": 123445433, "b":9898668}    +
                               |                                    | {"a": 123445321133, "b":9898668} +
                               |                                    | {"a": 123443333, "b":9898668}    +
                               |                                    | {"a": 1234442233, "b":9898668}

The cause of the issue is that gpss is not handling multiple lines of json data, It reads the multiple lines as a single chunk of json data.

Environment

Product Version: 6.18

Resolution

Fix

Upgrade to gpss 1.7.2 and use the FORMAT "jsonl" in yaml file to handle such a file.