Greenplum query fails with the error "interconnect may encounter a network error" due to an unstable network (UDP packet drop)

Products

VMware Tanzu Greenplum

Issue/Introduction

When there is a network issue between hosts of the Greenplum cluster, The query or the function may report the error like: “interconnect may encounter a network error, please check your network”.

In the current release of Greenplum, there is a code change implemented (Link of the PR), with this change in place, a query (usually happens on functions) might be hung after reporting the above error. This issue will be triggered if the below conditions match:

There is a network issue between hosts of the Greenplum cluster
The query, most of the time we observed is a function (UDF), that contains a complicated query inside of it. which will triggers SetupUDPInterconnect() many times.

This means when a user notices an “interconnect error”, this error usually leads to 2 kinds of results:

1. the query might still be able to finish eventually, even reported interconnect error. If that is the cause, then the issue is not caused by the above PR we mentioned, it is a pure network issue.

2. The query might hang forever, this is due to a code defect in Greenplum database

Environment

Product Version: 6.23

Resolution

Both above scenario is related to unstable network. to confirm the network issue, please do below steps:

Option#1, Enable the debug level of logs:

set gp_log_interconnect to debug; 
set log_min_messages to debug5;

Once those settings are enabled on the session level, re-run the query/function. And review the logs, we should see some logs like the below:

1. pruned the cursorHistoryTable

2023-08-25 04:06:37.541741...."DEBUG1","00000","prune cursor history table (count 257), icid 301",,,,,"SQL statement ...

2. GOT A MISMATCH PACKET WITH ID xx

2023-08-25 04:10:38.921353 EDT,..."LOG","00000","GOT A MISMATCH PACKET WITH ID 4 HISTORY HAS NO RECORD",,,,,,,0,,,,
$ grep 'mismatched packet received' con112459-seg74.txt | wc -l
9689

3. We can also see errors like “ack with bad seq”
in the below example, expected (1, 1] got 1 means we already got ACK from “1”, but still keep getting “1”

2023-08-25 04:02:36.402787 EDT........,"ack with bad seq?! expected (1, 1] got 1 flags 0x8b, capacity 15 consumedSeq 0",,,,,,,0,,,,

Option#2: iperf3 test against hosts.

For example: if we see an error like:

WARNING:  interconnect may encountered a network error, please check your network  (seg1 slice1 192.168.1.1:6000 pid=xxx)
DETAIL:  Failed to send packet (seq xx) to 192.168.1.2:12345 (pid xxxx cid -1) after 100 retries.

then the source is 192.168.1.1 and the destination is 192.168.1.2, run iperf3 like the below command.

server side: iperf3 -s
client side: iperf3 -uVc {server-host} -b1000m -t60 --get-server-output -l8192

(NOTE: please do test from 192.168.1.1 to 192.168.1.2, and also to same test from 192.168.1.2 to 192.168.1.1)

If we see the packet drop rate keeps showing non-zero values like below:

[  5] local xxxxxxx port xxxxx connected to xxxxx port xxxxx
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  94.2 MBytes   791 Mbits/sec  0.003 ms  1944/14008 (14%)
[  5]   1.00-2.00   sec   109 MBytes   913 Mbits/sec  0.010 ms  1314/15241 (8.6%)
[  5]   2.00-3.00   sec   107 MBytes   895 Mbits/sec  0.002 ms  1621/15273 (11%)
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec  0.003 ms  1349/15269 (8.8%)

then means there is a network issue in the cluster

How to fix this:

1. Please fix the network issue and then test the query again.
2. Based on experience, we found tuning below OS parameter below may help get rid of the packet drop issue in the network

net.core.rmem_default = 25165824
net.core.rmem_max = 33554432
net.ipv4.tcp_rmem = 16777216   25165824       33554432
net.ipv4.udp_rmem_min = 16777216

The production team is currently doing some internal investigation on the above OS settings. We might update the document in the future with the recommendations

3. R&D will fix the “query hung forever” issue (result#2) in future releases of Greenplum (target release is 6.26 and 6.25.3)