GPSS job unexpectedly exit when there is segment failure
search cancel

GPSS job unexpectedly exit when there is segment failure

book

Article ID: 295220

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Issue
When a segment goes down, there's chance that GPSS job will also exit with the following messages:

20230921 07:20:32.89143,343,debug,All workers of kafka reader gpopdata_json are finished
20230921 07:20:32.89144,343,debug,Channel gpopdata_json is closed
20230921 07:20:32.90480,343,debug,Close kafka reader 0
20230921 07:20:32.91796,343,debug,Close kafka reader 1
20230921 07:20:32.92776,343,debug,Close kafka reader 2
20230921 07:20:32.92781,343,debug,"start job gpopdata_json failed, Failed to execute batch: pq: FTS detected connection lost during dispatch to seg16 10.101.19.15:6000 pid=24045:"
20230921 07:20:32.92783,343,warning,"retry job gpopdata_json is disabled, stop schedule"


This is due to some known defect of GPSS version under 1.11, where GPSS will lost the retry information defined in [SCHEDULE] section in gpss.yaml file, when things are run in the following order:
1. submit GPSS job(at this moment the retry into is inited)
2. start GPSS job
3. stop GPSS job(retry info is cleared in this step and never recovered)
4. start GPSS job again(retry info will not be correctly loaded)

Therefore, a more likely case that could happen in customer's environment would be: for some reason customer ever restarted the GPSS job(without resubmit the gpss job), for example because the GPSS job ever errored out and failed. and after that, once the segment failed, it consequently result in instant connection lost between segment and GPSS end, and GPSS job will unexpectedly exit with the above messages.


Resolution

Workaround
The temporary workaround is to resubmit the job before restarting the job, e.g:
1. submit GPSS job
2. start GPSS job
3. stop GPSS job
4. resubmit GPSS job yaml file
5. start GPSS job

Solution
The solution would be upgrade to GPSS 1.11 or later