Customer reported a number of queries failed after upgrading from GPDB 6.27.1 running on RHEL 7.9 to GPDB 6.28.0 running on RHEL 8.8 with the following error, "interconnect encountered a network error, please check your network".
2024-10-23 11:44:47.198048 EDT,"username","dbname",p117108,th-596684160,"##.##.##.##","58436",2024-10-23 10:42:58 EDT,27846253,con507570,cmd15,seg-1,,dx1142214,x27846253,sx1,"ERROR","58M01","interconnect encountered a network error, please check your network (seg241 slice2 ##.##.##.##:40000 pid=8764)","Failed to send packet (seq 6) to ##.##.##.##:32248 (pid 85752 cid 285) after 3571 retries in 3600 seconds."
To troubleshoot this error, collect the following artifacts from both the sender and receiver.
1. In the gpdb logs where the "interconnect encountered an network error, please check your network", confirm the job is still running, collect the artifacts below from both the sender and receiver segments. SSH into the segment hosts, and confirm the pids are still active.
(seg241 slice2 ##.##.##.##:40000 pid=8764)","Failed to send packet (seq 6) to ##.##.##.##:32248 (pid 85752 cid 285) after 3571 retries
2. strace -f -k -p <pid>
3. pstack <pid>
4. lsof -n -P -E -p <pid>
5. gcore <pid>
6. packcore of gcore
7. run the script_log_alter.sql to enable debug for 10seconds.
alter system set log_min_messages = debug1;
alter system set gp_log_interconnect = debug;
select pg_reload_conf();
select pg_sleep(10);
alter system reset log_min_messages;
alter system reset gp_log_interconnect;
select pg_reload_conf();
8. Ask the customer to upload the artifacts to the case for analysis.
Prod
GPDB: 6.28.0
OS: RHEL 8.8
Engineering is still investigating this issue but has provided two workarounds.
Workaround 1:
SET gp_interconnect_fc_method = "capacity" at the session level for the failing query.
Workaround 2:
set gp_interconnect_queue_depth = 64 at the session level for the failing query. This can be faster than the previous workaround (gp_interconnect_fc_method). The downside is slightly more memory used by the query.