How to troubleshot intermittent network issues in Tanzu Greenplum
search cancel

How to troubleshot intermittent network issues in Tanzu Greenplum

book

Article ID: 296706

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

You are experiencing an intermittent network issue and you do not have a good way to validate the cause in Tanzu Greenplum. 

For example, you may see the following logs for gpfdist:

2021-08-05 09:40:32.412156 CST|jzfp|jzfpdb|p464972|th141129504|[local]||2021-08-05 09:39:26 CST|7257181|con3596|cmd3|seg-1||dx21822|x7257181|sx1|ERROR: |08006|connection with gpfdist failed for gpfdist://10.##.###.160:8083/AGRI/AGRI_INC_DATA/20210703/CNA01100207.dat.gz. effective url: http://10.##.###.160:8083/AGRI/AGRI_INC_DATA/20210703/CNA01100207.dat.gz. error code = 110 (Connection timed out)  (seg28 slice1 sdw5:33004 pid=35177)||||||insert into JZFP.AGRI_INC_ODS_A_D_UTIOBJECTINFO
(...


The issue occurs randomly and you are unable to identify the root cause through the logs since the process hangs and produces the same error. In this scenario, you need to use tcpdump for the analysis for the packets. However, since it this issue happens on random segments and random hosts, it is hard to monitor every NICs' traffic using tcpdump for troubleshooting.

Before using tcpdump, you can use some other tools to test for network issues. This article covers How to troubleshot intermittent network issues in Tanzu Greenplum with 3rd party tools.

Environment

Product Version: 6.16

Resolution

In this case, you can use 3rd party tools to simulate the traffic of the gpfdist. If the same issue happens when using 3rd party tools, then it is clear that our software (gpfdist) is not the cause.

Note: In the below example, we would use nc to simulate the server and client. This article covers how to use 3rd party tools for normal sites and restricted sites. 


Normal sites 

Firstly, use nc to set up the server to listen on port 8083 on the master host:

[gpadmin@mdw ~]$ nc -l -k 8083 


On the segment hosts, use nc to connect to 8083 port of the master host. Use grep to get rid of the successful messages since it is an intermittent issue and we only care about the failures.

In some environments, the following messages are displayed after 5 minutes:

[gpadmin@sdw3 ~]$ while true; do nc -zv 10.19.232.160 8083 ; sleep 0.1;done | grep -v succeed
nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out
nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out
nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out

[gpadmin@sdw4 ~]$ while true; do nc -zv 10.##.###.160 8083 ; sleep 0.1;done | grep -v succeed
nc: connect to 10.##.###.160 port 8083 (tcp) failed: Connection timed out


Note: You need at test for 10 minutes to test for intermittent network issues.

If you see the Connection timed out or any other error message, then it is OS/Network issue . The same rules apply to other software like PXF, Greenplum Stream Serves, etc. 


Restricted sites

In some restricted sites, you might not have nc. In this case, use the openssl to set up the server and telnet to work as a client.

Use this command to generate some random certificates:

[root@gpdb-sandbox ~]# openssl req -x509 -nodes -days 365 -subj '/C=US/ST=Ca/L=Sunnydale/CN=www.unserdom.com' -newkey rsa:1024 -keyout prikey.pem -out cert.pem
Generating a 1024 bit RSA private key
...................++++++
.....++++++
writing new private key to 'prikey.pem'
-----
[root@gpdb-sandbox ~]# ls -l
total 28
-rw-------. 1 root root 1729 Dec 15  2017 anaconda-ks.cfg
-rw-r--r--  1 root root  883 Aug 10 20:15 cert.pem
-rw-------. 1 root root 1203 Dec 15  2017 original-ks.cfg
-rw-r--r--  1 root root  920 Aug 10 20:15 prikey.pem
-rw-r--r--  1 root root    0 Nov 28  2020 privkey.pem
-rw-r--r--  1 root root 9504 Aug  4 03:45 yum.out


Use the below command to start up an HTTPS server: 

[root@gpdb-sandbox ~]#  openssl s_server -accept 443 -cert cert.pem -key prikey.pem -www


In another session, we can employ telnet to achieve the same task as nc to work as the client to connect to the server's port 443. Output the success message to '/dev/null' to avoid too much unnecessary info.

Below is example output when hitting network issues:

[root@gpdb-sandbox ~]# while true; do echo -e '\x1dclose\x0d' | telnet 127.#.#.1 443 > /dev/null ; sleep 1; done
telnet: connect to address 127.#.#.1: Connection refused
telnet: connect to address 127.#.#.1: Connection refused
telnet: connect to address 127.#.#.1: Connection refused
telnet: connect to address 127.#.#.1: Connection refused


If the above tests shows that the OS/Network has issues, then work with you OS/Network admins to identify and resolve the issue.