You are experiencing an intermittent network issue and you do not have a good way to validate the cause in Tanzu Greenplum.
For example, you may see the following logs for gpfdist:
2021-08-05 09:40:32.412156 CST|jzfp|jzfpdb|p464972|th141129504|[local]||2021-08-05 09:39:26 CST|7257181|con3596|cmd3|seg-1||dx21822|x7257181|sx1|ERROR: |08006|connection with gpfdist failed for gpfdist://10.##.###.160:8083/AGRI/AGRI_INC_DATA/20210703/CNA01100207.dat.gz. effective url: http://10.##.###.160:8083/AGRI/AGRI_INC_DATA/20210703/CNA01100207.dat.gz. error code = 110 (Connection timed out) (seg28 slice1 sdw5:33004 pid=35177)||||||insert into JZFP.AGRI_INC_ODS_A_D_UTIOBJECTINFO (...
The issue occurs randomly and you are unable to identify the root cause through the logs since the process hangs and produces the same error. In this scenario, you need to use tcpdump for the analysis for the packets. However, since it this issue happens on random segments and random hosts, it is hard to monitor every NICs' traffic using tcpdump for troubleshooting.
Before using tcpdump, you can use some other tools to test for network issues. This article covers How to troubleshot intermittent network issues in Tanzu Greenplum with 3rd party tools.
In this case, you can use 3rd party tools to simulate the traffic of the gpfdist. If the same issue happens when using 3rd party tools, then it is clear that our software (gpfdist) is not the cause.
Note: In the below example, we would use nc to simulate the server and client. This article covers how to use 3rd party tools for normal sites and restricted sites.
Firstly, use nc to set up the server to listen on port 8083 on the master host:
[gpadmin@mdw ~]$ nc -l -k 8083
On the segment hosts, use nc to connect to 8083 port of the master host. Use grep to get rid of the successful messages since it is an intermittent issue and we only care about the failures.
In some environments, the following messages are displayed after 5 minutes:
[gpadmin@sdw3 ~]$ while true; do nc -zv 10.19.232.160 8083 ; sleep 0.1;done | grep -v succeed nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out [gpadmin@sdw4 ~]$ while true; do nc -zv 10.##.###.160 8083 ; sleep 0.1;done | grep -v succeed nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out
Note: You need at test for 10 minutes to test for intermittent network issues.
If you see the Connection timed out or any other error message, then it is OS/Network issue . The same rules apply to other software like PXF, Greenplum Stream Serves, etc.
In some restricted sites, you might not have nc. In this case, use the openssl to set up the server and telnet to work as a client.
Use this command to generate some random certificates:
[root@gpdb-sandbox ~]# openssl req -x509 -nodes -days 365 -subj '/C=US/ST=Ca/L=Sunnydale/CN=www.unserdom.com' -newkey rsa:1024 -keyout prikey.pem -out cert.pem Generating a 1024 bit RSA private key ...................++++++ .....++++++ writing new private key to 'prikey.pem' ----- [root@gpdb-sandbox ~]# ls -l total 28 -rw-------. 1 root root 1729 Dec 15 2017 anaconda-ks.cfg -rw-r--r-- 1 root root 883 Aug 10 20:15 cert.pem -rw-------. 1 root root 1203 Dec 15 2017 original-ks.cfg -rw-r--r-- 1 root root 920 Aug 10 20:15 prikey.pem -rw-r--r-- 1 root root 0 Nov 28 2020 privkey.pem -rw-r--r-- 1 root root 9504 Aug 4 03:45 yum.out
Use the below command to start up an HTTPS server:
[root@gpdb-sandbox ~]# openssl s_server -accept 443 -cert cert.pem -key prikey.pem -www
In another session, we can employ telnet to achieve the same task as nc to work as the client to connect to the server's port 443. Output the success message to '/dev/null' to avoid too much unnecessary info.
Below is example output when hitting network issues:
[root@gpdb-sandbox ~]# while true; do echo -e '\x1dclose\x0d' | telnet 127.#.#.1 443 > /dev/null ; sleep 1; done telnet: connect to address 127.#.#.1: Connection refused telnet: connect to address 127.#.#.1: Connection refused telnet: connect to address 127.#.#.1: Connection refused telnet: connect to address 127.#.#.1: Connection refused
If the above tests shows that the OS/Network has issues, then work with you OS/Network admins to identify and resolve the issue.