GPSS server process consume too much memory and it is still keep growing gradually

Products

VMware Tanzu Greenplum

Issue/Introduction

There were significant the statistic of memory consuming by GPSS server process and it's still keep growing gradually in production environment which could be the out of memory situation.

Memory Increasing statistics

date | max mem
20231109 | 32GB (12.5%)
20231110 | 46GB (17.9%)
20231111 | 63GB (24.6%)
20231112 | 79GB (30.8%)
20231113 | 104GB (40.6%)
20231114 | 114GB (44.5%)
20231115 | 148GB (57.8%)

Especially the size of RSS and VSZ consuming by GPSS server process are keep growing gradually when collecting process map by below script every hours.

#!/bin/bash
 
INTERVAL=3600
while [ true ]
do
 
    date >> pmap-info-gpss.out
    pmap -x <pid of gpss> >> pmap-info-gpss.out
 
    date >> smaps-info-gpss.out
    cat /proc/<pid of gpss>/smaps >> smaps-info-gpss.out
 
    date >> vmem-info-gpss.out
    cat /proc/<pid of gpss>/status | egrep "Vm(Peak|Size|RSS|Swap|Data)"
 
    sleep $INTERVAL

 done

It's now quite serious situation that application could be unavailable and migrated data from kafka could be loss and inconsistent in case of unexpectedly stopping GPSS services due to OOM or current phenomenon. What's the workaround to avoid this situation and the cause and resolution for this issue?

Environment

Product Version: 6.23
OS: RHEL or CentOS 7

Resolution

Firstly you could avoid this issue restarting GPSS as it release memory hold previously and allocate new addresses of virtual memory. So we recommend to restart it once a day depend on your workload.

This issue is going to be fixed in GPSS 1.10.5 and later version in the future as GPSS Engineering team is now fixing at the source code level. because it's internal bug.
If you would like to confirm if your issue matches it, please collect the following artifacts and open a ticket to provide them VMware Tanzu Support.

[ Investigation Steps ]
1) First, please add DebugPort in gpss json file.

  "ListenAddress": {
    "Host": "",
    "Port": 51007,
    "SSL": false,
    "DebugPort": 9999
 
},
    "Gpfdist": {
      "Host": "",
      "Port": 8419
    }
}

2) Stop current GPSS server process
3) Start GPSS again with above json config.
4) Run the following script to collect pprofile of heap and goroutine consuming by GPSS process every hours.

#!/bin/bash
 
INTERVAL=3600
while [ true ]
do
 
    curl http://127.0.0.1:9999/debug/pprof/heap >> $(date +%Y-%m-%d-%H-%M)-heap.out
    curl http://127.0.0.1:9999/debug/pprof/goroutine >> $(date +%Y-%m-%d-%H-%M)-goroutine.out 

    sleep $INTERVAL
 
done

[ Root Cause Analysis ]

GPSS support monitoring job status. It is implemented by the MonitorJob() GPRC function.
However in customer environment, when a GPSSCLI monitor is stopped, the memory of GPSS keeps increasing, the reason is that when the GPSSCLI exit, the channel created in listen.notifier could not be Deleted. So each KAFKA batch will emit a new event to the notifier and create a go routine trying to send event to the old channel which will cause go routine leak.
The detailed reason is that sync.Map is a type strict implementation, the element ch is stored by n.forkers.Store(ch, true) code, so the delete has to be the same type n.forkers.Delete(ch). Otherwise, the channel will not be deleted from notifier.forkers by Unfork()