Greenplum logs show instances of an error similar to:
2025-05-15 01:08:39.816264 UTC,"gpmon","gpperfmon",p3688255,th2074723456,"10.200.10.1","38334",2025-05-15 00:08:39 UTC,1225579,con206991,cmd9,segN,slice1,dx737529,x1225579,sx1,"ERROR","58M01","interconnect encountered a network error, please check your network","Failed to send packet (seq 1) to 10.200.10.1:65330 (pid 2070570 cid 44) after 3581 retries in 3600 seconds.",,,,,"INSERT INTO gpmetrics.gpcc_plannode_history SELECT * FROM gpmetrics._gpcc_plannode_history",0,,"ic_udpifc.c",5152,
Greenplum 6.X
Greenplum 7.X
Azure
Azure requires that port 65330 on each Virtual Machine be reserved for Microsoft backend operations. If it isnt this may lead to interconnect failures of connections on several ports including but not limited to 65330.
For cases where the Greenplum environment is running in Azure and the majority of the failures are attributed to port 65330, we can assume that the issue is caused by this port not being reserved, as outlined in both Microsoft and Broadcom documentation:
"Network
The Accelerated Networking option offloads CPU cycles for networking to "FPGA-based SmartNICs". Virtual machine types either support this or do not, but most do support it. Testing of Greenplum hasn't shown much difference and this is probably because of Azure's preference for TCP over UDP. Despite this, UDPIFC interconnect is the ideal protocol to use in Azure.
There is an undocumented process in Azure that periodically runs on the host machines on UDP port 65330. When a query runs using UDP port 65330 and this undocumented process runs, the query will fail after one hour with an interconnect timeout error. This is fixed by reserving port 65330 so that Greenplum doesn't use it."
To reserve this port, add the below line to sysctl.conf of each Greenplum host and then execute "sudo sysctl -p":
net.ipv4.ip_local_reserved_ports=65330
Errors of the form "interconnect encountered a network error, please check your network" are general network errors so to get a better understanding of the cause, it may be helpful to count the IP:PORT instances of each failure. You can run the below script to get a count of the IP:PORT occurrences of each failure, replacing LOG_FILE.txt with the name of your log artifact:
awk '
BEGIN {
printf "%-8s %s\n", "COUNT", "PORT"
}
{
if (match($0, /Failed to send packet \(seq [0-9][0-9]*\) to [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:[0-9]+/)) {
s = substr($0, RSTART, RLENGTH)
if (match(s, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:[0-9]+/)) {
split(substr(s, RSTART, RLENGTH), parts, ":")
ip = parts[1]
port = parts[2]
c_port[port]++
c_ip[ip]++
}
}
}
END {
for (p in c_port) printf "%-8d %s\n", c_port[p], p
print ""
printf "%-8s %s\n", "COUNT", "IP"
for (i in c_ip) printf "%-8d %s\n", c_ip[i], i
}' LOG_FILE.txt
Output:
COUNT PORT
2 47269
14 65330
COUNT IP
6 10.200.32.1
6 10.200.32.2
1 10.200.32.3
3 10.200.32.4
The above output shows that port 65330 is the majority source of connection failures.