While running arbitrary queries, particular queries occasionally hang, and time out with this Interconnect error:
Execution error: ERROR: Interconnect encountered a network error, please check your network (seg64 slice1 nvlxa487data:40000 pid=162631)
You will also see errors like below in the segment logs:
Failed to send packet (seq 1) to 10.xx.xx.xxx:xxxx (pid 182182 cid -1) after 3580 retries in 3600 seconds
Steps to troubleshoot this situation:
1. Follow the steps mentioned in the above Troubleshooting guide to find network errors, if any.
2. If the problem cannot be determined using the Network Troubleshooting Guide, follow the steps below to identify if this article is applicable-
Run a series of simple queries until one hangs:
psql -c ‘CREATE TABLE foo (a int);’ psql -c ‘INSERT INTO foo VALUES(generate_series(1,1000))’ while true; do echo -n `date +"%T.%N :"` && psql -t -c ‘SELECT count(a) FROM foo’; done
In another terminal, find the connection number for the hung query:
gpadmin=# select * from pg_stat_activity; datid | datname | procpid | sess_id | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port | ap plication_name | xact_start | waiting_reason -------+---------+---------+---------+----------+---------+------------------------------------+---------+-------------------------------+-------------------------------+----------------+-------------+--- ---------------+-------------------------------+---------------- 17154 | gpadmin | 7652 | 633 | 10 | gpadmin | select count(*) from foo; | f | 2017-03-27 17:41:28.610611+00 | 2017-03-27 17:38:42.958784+00 | 192.168.64.100 | 40332 | ps ql | 2017-03-27 17:41:28.610611+00 |
Check to see on which nodes the query is still active:
[gpadmin@mdw ~]$ psql -t -c ‘select distinct hostname from gp_segment_configuration;’ > seg_hostfile [gpadmin@mdw ~]$ gpssh -f seg_hostfile Note: command history unsupported on this machine ... => ps -ef | grep con633 | grep -v grep [sdw2.gpmt.local] gpadmin 6353 3629 0 17:45 ? 00:00:00 postgres: port 40000, gpadmin gpadmin 192.168.64.100(47186) con633 seg2 cmd8 slice1 idle [sdw2.gpmt.local] gpadmin 6355 3631 0 17:45 ? 00:00:00 postgres: port 40001, gpadmin gpadmin 192.168.64.100(22030) con633 seg3 cmd8 slice1 idle [ mdw.gpmt.local] gpadmin 7652 3780 0 17:38 ? 00:00:00 postgres: port 5432, gpadmin gpadmin 192.168.64.100(40332) con633 192.168.64.100(40332) cmd7 SELECT [sdw1.gpmt.local] gpadmin 6353 3627 0 17:45 ? 00:00:00 postgres: port 40000, gpadmin gpadmin 192.168.64.100(40454) con633 seg0 cmd8 slice1 idle [sdw1.gpmt.local] gpadmin 6355 3630 0 17:45 ? 00:00:00 postgres: port 40001, gpadmin gpadmin 192.168.64.100(32030) con633 seg1 cmd8 slice1 MPPEXEC SELECT
If you see more than one segment host and the master actively running a select command, then it is unlikely to be related to this article.
3. Identify the ephemeral port range:
[gpadmin@mdw ~]$ gpssh -f seg_hostfile Note: command history unsupported on this machine ... [sdw1.gpmt.local] gpadmin 6355 3630 0 17:45 ? 00:00:00 postgres: port 40001, gpadmin gpadmin 192.168.64.100(32030) con633 seg1 cmd8 slice1 MPPEXEC SELECT => cat /proc/sys/net/ipv4/ip_local_port_range [sdw2.gpmt.local] 1025 65535 [ mdw.gpmt.local] 1025 65535 [sdw1.gpmt.local] 1025 65535
Scenario 1 - Firewall issues
Determine if the firewall is “ON” on the segments and the master:
service iptables status
If the firewall is active, determine if it is blocking UDP interconnect traffic on any of the ephemeral ports.
Scenario 2 - One or more specific port is blocked
Go to the segment host with the active select statement identified above:
[gpadmin@mdw ~]$ gpssh -f seg_hostfile Note: command history unsupported on this machine ... => ps -ef | grep con633 | grep -v grep [sdw2.gpmt.local] gpadmin 6353 3629 0 17:45 ? 00:00:00 postgres: port 40000, gpadmin gpadmin 192.168.64.100(47186) con633 seg2 cmd8 slice1 idle [sdw2.gpmt.local] gpadmin 6355 3631 0 17:45 ? 00:00:00 postgres: port 40001, gpadmin gpadmin 192.168.64.100(22030) con633 seg3 cmd8 slice1 idle [ mdw.gpmt.local] gpadmin 7652 3780 0 17:38 ? 00:00:00 postgres: port 5432, gpadmin gpadmin 192.168.64.100(40332) con633 192.168.64.100(40332) cmd7 SELECT [sdw1.gpmt.local] gpadmin 6353 3627 0 17:45 ? 00:00:00 postgres: port 40000, gpadmin gpadmin 192.168.64.100(40454) con633 seg0 cmd8 slice1 idle [sdw1.gpmt.local] gpadmin 6355 3630 0 17:45 ? 00:00:00 postgres: port 40001, gpadmin gpadmin 192.168.64.100(32030) con633 seg1 cmd8 slice1 MPPEXEC SELECT
Run strace on the query executor:
strace -f -p <PID> | egrep “sendto|recvfrom” [gpadmin@mdw ~]$ strace -f -p 6355 2>&1 | egrep “sendto|recvfrom” [pid 6355] sendto(9, “\2\0\0\0\316e\0\0\227\261\0\0\211\262\0\0\221_\0\0\24\0\0\0\7\0\0\0\0\0\0\0\2\0\0\0\20\0\0\0\377\377\377\377\0\0\0\0\10\0\0\0D\0\0\0\1\0\0\0\0\0\0\0\0\0\4\0", 68, 0, {sa_family=AF_INET6, sin6_port=htons(24465), inet_pton(AF_INET6, “::ffff:10.255.65.140”, &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 68 [pid 6355] recvfrom(8, 0x7ffdad65258f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 6355] recvfrom(8, 0x7ffdad65258f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 6355] recvfrom(8, 0x7ffdad65258f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 6355] recvfrom(8, 0x7ffdad65258f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) [pid 6355] recvfrom(8, 0x7ffdad65258f, 1, 66, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
You should see repeated attempts to send data to the master (sendto), but all attempts to read a response (recvfrom) will time out with -1 EAGAIN (Resource temporarily unavailable). The sendto system calls will contain the interconnect port that the segment is sending interconnect traffic to.
On the master host, confirm that the query is listening on the UDP port:
[gpadmin@mdw ~]$ netstat -pantu | grep 24465 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) udp 0 0 :::24465 :::* 7652/postgres
As root on both the master and the segment hosts, run tcpdump to see if the traffic is being seen on both nodes:
[root@mdw ~]# tcpdump -i any port 24465 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 21:38:01.231518 IP sdw2.gpmt.local.6784 > mdw.gpmt.local.24465: UDP, length 68 21:38:01.231519 IP mdw.gpmt.local.24465 > sdw2.gpmt.local.6784: UDP, length 64 21:38:02.231617 IP sdw2.gpmt.local.6784 > mdw.gpmt.local.24465: UDP, length 68 21:38:02.231625 IP mdw.gpmt.local.24465 > sdw2.gpmt.local.6784: UDP, length 64 ^C 0 packets captured 0 packets received by filter 0 packets dropped by kernel [root@sdw2 ~]# tcpdump -i any port 24465 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 21:38:01.231518 IP sdw2.gpmt.local.6784 > mdw.gpmt.local.24465: UDP, length 68 21:38:02.231617 IP sdw2.gpmt.local.6784 > mdw.gpmt.local.24465: UDP, length 68 ^C 0 packets captured 0 packets received by filter 0 packets dropped by kernel
In case of the above scenario, the segment is able to successfully send packets to the master, but the return route is being blocked.
You can run the attached script on both the ports to see if the data is flowing through the ports:
<data_flow.py> <host> <port>