greenplum system freeze (SNMP inform/trap )
search cancel

greenplum system freeze (SNMP inform/trap )

book

Article ID: 296462

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

The customer experienced a cluster freeze for few minutes in regular intervals. Everything, including SELECT pg_backend_pid was hanging. Upon checking the master log with tail -f we could see that every time the system halts it is on.
2020-06-04 19:44:36.505576 WIB,,,p22629,th579106592,,,,0,,,seg-1,,,,,"DEBUG1","00000","SNMP inform/trap alerts are disabled",,,,,,,,"send_snmp_inform_or_trap","sendalert.c",231,
2020-06-04 19:46:50.729581 WIB,"prd_etl","CVMDM",p8822,th579106592,"10.24.162.17","50437",2020-06-04 17:03:55 WIB,0,con1146,cmd2,seg-1,,,,,"LOG","08P01","unexpected EOF on client connection",,,,,,,0,,"postgres.c",451,
As we can see in the example above, the system was processing queries and would suddenly freeze for over 2 minutes then resume.

Environment

Product Version: 5.21

Resolution

We found that the issue was with a master logger process. Checking the configuration found multiple gp_email_to addresses to send an email in case of an issue.
It looks like the cluster freezes waiting for the snmp server to come back. Example of pstack of master logger process below, we can see curl wait:
$ pstack 22629
#0  0x0000003cb70df358 in poll () from /lib64/libc.so.6
#1  0x00007f4c23be0a36 in Curl_poll () from /usr/local/greenplum-db-5.21.4/lib/libcurl.so.4
#2  0x00007f4c23bdc74b in curl_multi_wait () from /usr/local/greenplum-db-5.21.4/lib/libcurl.so.4
#3  0x00007f4c23bd6695 in curl_easy_perform () from /usr/local/greenplum-db-5.21.4/lib/libcurl.so.4
#4  0x00000000007eb31c in send_alert_via_email ()
#5  0x00000000007ebdc6 in send_alert ()
#6  0x00000000007eca0b in send_alert_from_chunks ()
#7  0x00000000007e6e00 in syslogger_log_chunk_list ()
#8  0x00000000007e7acc in SysLoggerMain.isra.6 ()
#9  0x00000000007e8434 in SysLogger_Start ()
#10 0x00000000007df9e7 in PostmasterMain ()
#11 0x0000000000717be7 in main ()

Workaround:

Remove multiple email addresses. With only 1 or 2 destination addresses, the system came back to normal.

Permanent fix

Use GPCC alerts or upgrade to GP6.