When there is a network issue between hosts of the Greenplum cluster, The query or the function may report the error like: “interconnect may encounter a network error, please check your network”.
In the current release of Greenplum, there is a code change implemented (Link of the PR), with this change in place, a query (usually happens on functions) might be hung after reporting the above error. This issue will be triggered if the below conditions match:
This means when a user notices an “interconnect error”, this error usually leads to 2 kinds of results:
1. the query might still be able to finish eventually, even reported interconnect error. If that is the cause, then the issue is not caused by the above PR we mentioned, it is a pure network issue.
2. The query might hang forever, this is due to a code defect in Greenplum database
Product Version: 6.23
Both above scenario is related to unstable network. to confirm the network issue, please do below steps:
Option#1, Enable the debug level of logs:
set gp_log_interconnect to debug;
set log_min_messages to debug5;
Once those settings are enabled on the session level, re-run the query/function. And review the logs, we should see some logs like the below:
1. pruned the cursorHistoryTable
2023-08-25 04:06:37.541741...."DEBUG1","00000","prune cursor history table (count 257), icid 301",,,,,"SQL statement ...
2. GOT A MISMATCH PACKET WITH ID xx
2023-08-25 04:10:38.921353 EDT,..."LOG","00000","GOT A MISMATCH PACKET WITH ID 4 HISTORY HAS NO RECORD",,,,,,,0,,,,
$ grep 'mismatched packet received' con112459-seg74.txt | wc -l
9689
3. We can also see errors like “ack with bad seq”
in the below example, expected (1, 1] got 1 means we already got ACK from “1”, but still keep getting “1”
2023-08-25 04:02:36.402787 EDT........,"ack with bad seq?! expected (1, 1] got 1 flags 0x8b, capacity 15 consumedSeq 0",,,,,,,0,,,,
Option#2: iperf3 test against hosts.
For example: if we see an error like:
WARNING: interconnect may encountered a network error, please check your network (seg1 slice1 192.168.1.1:6000 pid=xxx)
DETAIL: Failed to send packet (seq xx) to 192.168.1.2:12345 (pid xxxx cid -1) after 100 retries.
then the source is 192.168.1.1 and the destination is 192.168.1.2, run iperf3 like the below command.
(NOTE: please do test from 192.168.1.1 to 192.168.1.2, and also to same test from 192.168.1.2 to 192.168.1.1)
If we see the packet drop rate keeps showing non-zero values like below:
[ 5] local xxxxxxx port xxxxx connected to xxxxx port xxxxx
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 5] 0.00-1.00 sec 94.2 MBytes 791 Mbits/sec 0.003 ms 1944/14008 (14%)
[ 5] 1.00-2.00 sec 109 MBytes 913 Mbits/sec 0.010 ms 1314/15241 (8.6%)
[ 5] 2.00-3.00 sec 107 MBytes 895 Mbits/sec 0.002 ms 1621/15273 (11%)
[ 5] 3.00-4.00 sec 109 MBytes 912 Mbits/sec 0.003 ms 1349/15269 (8.8%)
then means there is a network issue in the cluster
How to fix this:
1. Please fix the network issue and then test the query again.
2. Based on experience, we found tuning below OS parameter below may help get rid of the packet drop issue in the network
net.core.rmem_default = 25165824
net.core.rmem_max = 33554432
net.ipv4.tcp_rmem = 16777216 25165824 33554432
net.ipv4.udp_rmem_min = 16777216
The production team is currently doing some internal investigation on the above OS settings. We might update the document in the future with the recommendations
3. R&D will fix the “query hung forever” issue (result#2) in future releases of Greenplum (target release is 6.26 and 6.25.3)