Network Troubleshooting Guide for Tanzu Greenplum

Products

VMware Tanzu Greenplum VMware Tanzu Data Suite VMware Tanzu Data Suite Greenplum

Issue/Introduction

In Tanzu Greenplum, we may find ourselves troubleshooting network packet loss errors. In this article, we will describe some of the standard tools available on most systems for troubleshooting network-related errors and issues. We also explain which layer these tools are reporting on so that the user can get a sense of where the fault might be and how to proceed from there. Understanding what the available network troubleshooting tools are and when to use them can be a big help with misbehaving distributed applications.

The basic high-level life cycle of a network datagram as it travels through the NIC (Network Interface Card) to the application. This is obviously skipping a lot of information and a lot of abstract concepts. So do not expect to get a deep dive into how Linux handles the network. Instead, this article should provide guidance on how to troubleshoot the network.

The packet will first reach the NIC firmware.
The packet then gets buffered by the NIC driver and waits for the kernel.
When the kernel is ready, the packet gets pulled off of the driver buffer and into the kernel's buffer.
Finally, when the application is ready, the packet will be pulled from the kernel buffer.

The simplified image above shows us that there are a lot of layers that a network packet has to go through before it reaches the applications. It is important to understand this because sometimes when seeing packet loss on the client, it can actually be directly correlated to an application issue. Distributed applications rely heavily on the network to share and process data. If the application cannot keep up with the incoming data, then we may start to see network packets dropped by the kernel. Understanding this basic flow will help you work the problem.

Bad Network

From the typical distributed application user's perspective, it may not always be obvious if the problem that they are facing is related to the external network. It is important to rule out what you can before assuming immediately that it is some mystical networking issue. Before you call in your network expert, we will explore some things that can be checked first.

Environment

OS: RHEL 6.x

Resolution

Inspecting at the NIC/Firmware/Driver level

ethtool can read information from the NIC driver given an associated interface name.

This command shows NIC Driver State. Immediately from this output, we can see if the interface link is up or down "Link detected: yes". In addition, we can ensure the Speed is 10Gbit and Duplex is Full:

ethtool <interface name>
root@node4 ~]# ethtool eth0
Settings for eth0:
 Supported ports: [ ]
 Supported link modes: Not reported
 Supported pause frame use: No
 Supports auto-negotiation: No
 Advertised link modes: Not reported
 Advertised pause frame use: No
 Advertised auto-negotiation: No
 Speed: 10000Mb/s
 Duplex: Full
 Port: Twisted Pair
 PHYAD: 0
 Transceiver: internal
 Auto-negotiation: off
 MDI-X: Unknown
 Link detected: yes

Checking the NIC Driver settings. In general, distributed applications will send and receive large amounts of data. So having things like generic receive offload (GRO) or large receive offload (LRO) enabled may result in a network bottleneck at the driver level. In the case of GPDB, we transmit UDP datagrams in8192 byte chunks which results in fragmentation that will put your network interface card to work. If you have these settings enabled and are experiencing network latency related symptoms, then a good test would be to disable GRO and LRO:

ethtool -k <interface name>
[root@mdw ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
large-receive-offload: on

Capital S switch will dump all the firmware/driver counters available. Typically there is a lot of information dumped and the outputs will vary depending on software revisions. If you find that there are CRC errors, then you can assume there is a hardware-related problem with the NIC cable or the NIC card itself. The recommended actions here would be to first try and replace the network cable that plugs into the eth port and ask the network team to rule out the hardware on the switch. If all that fails, then replace the NIC on the given server.

ethtool -S <interface name>
[root@mdw ~]# ethtool -S eth0 | egrep "crc|error"
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 123
     rx_align_errors: 0

ifconfig and "netstat -i" will both pretty much give you the same kind of information which are the tx/rx error counters you will see in ethtool -S. Counters found in RX-DRP or RX-OVR are typical indicators that there are packet errors at the driver level. If you see Drops or Overruns then it could mean that the kernel is not pulling packets fast enough or the driver is not able to keep up with the workload. In the case of the driver, it would be a good time to reach out to the vendor for support and see if there are any firmware/driver updates or settings that can be implemented to improve performance here.

ifconfig <interface name>

[root@gpdb-sandbox ~]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:0C:29:F7:B1:14 
 inet addr:172.16.34.128 Bcast:172.16.34.255 Mask:255.255.255.0
 inet6 addr: fe80::20c:29ff:fef7:b114/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:84887 errors:0 dropped:0 overruns:0 frame:0
 TX packets:28074 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000 
 RX bytes:43123224 (41.1 MiB) TX bytes:5168804 (4.9 MiB)

netstat -i <interface name>

[root@mdw ~]# netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0 618457118      0      0      0 644714020      0      0      0 BMRU
eth1       1500   0 106690824      0      0      0 13942215      0      0      0 BMRU
eth4       1500   0 579489994      0      0      0  4526273      0      0      0 BMRU
eth5       1500   0     5145      0      0      0       11      0      0      0 BMRU
lo        16436   0  1040195      0      0      0  1040195      0      0      0 LRU
vmnet1     1500   0        0      0      0      0        6      0      0      0 BMRU
vmnet8     1500   0  1896834      0      0      0        6      0      0      0 BMRU

Inspecting at the kernel and application level

tpcdump operates at the kernel level and can be useful for inspecting TCP based applications. For information on how and when to use this tool, refer to Running TCPDUMP to debug distributed applications.

"netstat -s" will dump out all of the kernel counters and provides detailed information related to the UDP/TCP kernel stack counters.

If the TCP-collapsed counter is incrementing, then it may indicate the application is not reading packets off of the kernel buffer fast enough. While the kernel is dropping packets here, it means the application could very well be the bottleneck and we need to determine if the problem is simply workload-related.

netstat -st |egrep -i collapsed
[root@mdw ~]# netstat -st | egrep -i collapsed
    3190 packets collapsed in receive queue due to low socket buffer

If we see a lot of network retransmissions, it usually indicates that the receiver is not acknowledging our packets fast enough or the receiver has never received the original packet as it was dropped somewhere along the way.

netstat -st |egrep -iretrans
[root@mdw ~]# netstat -st | egrep -i retrans
    81145 segments retransmited
    51597 fast retransmits
    25709 forward retransmits
    1750 retransmits in slow start
    62 sack retransmits failed

If UDP is reporting received errors, then that is similar to TCP collapsed errors where the application is not reading data fast enough from the kernel. You might see this counter go up in cases where there is a processing skew in GPDB because all segments are sending data over UDP to a single segment.
```
netstat -su |egrep error
[root@mdw ~]# netstat -su | egrep error
    0 packet receive errors
```

Checklist:
Inspecting at the NIC/Firmware/Driver level

ethtool can read information from the NIC driver given an associated interface name.

This command shows NIC Driver State. Immediately from this output, we can see if the interface link is up or down "Link detected: yes". In addition, we can ensure the Speed is 10Gbit and Duplex is Full

ethtool <interface name>
root@node4 ~]# ethtool eth0
Settings for eth0:
 Supported ports: [ ]
 Supported link modes: Not reported
 Supported pause frame use: No
 Supports auto-negotiation: No
 Advertised link modes: Not reported
 Advertised pause frame use: No
 Advertised auto-negotiation: No
 Speed: 10000Mb/s
 Duplex: Full
 Port: Twisted Pair
 PHYAD: 0
 Transceiver: internal
 Auto-negotiation: off
 MDI-X: Unknown
 Link detected: yes

Checking the NIC Driver settings. In general, distributed applications will send and receive large amounts of data. So having things like generic receive offload (GRO) or large receive offload (LRO) enabled may result in a network bottleneck at the driver level. In the case of GPDB, we transmit UDP datagrams in8192 byte chunks whichresults inip fragmentation that will put your network interface card to work. If you have these settings enabled and are experiencing network latency related symptoms, then a good test would be to disable GRO and LRO
```
ethtool -k <interface name>
[root@mdw ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
large-receive-offload: on
```
Capital S switch will dump all the firmware/driver counters available. Typically there is a lot of information dumped and the outputs will vary depending on software revisions. If you find that there are CRC errors, then you can assume there is a hardware-related problem with the NIC cable or the NIC card itself. The recommended actions here would be to first try and replace the network cable that plugs into the eth portandask the network team to rule out hardware on the switch. If all that fails, then replace the NIC on the given server.
```
ethtool -S <interface name>
[root@mdw ~]# ethtool -S eth0 | egrep "crc|error"
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 123
     rx_align_errors: 0
```

ifconfig and "netstat -i" will both pretty much give you the same kind of information which are the tx/rx error counters you will see in ethtool -S. Counters found in RX-DRP or RX-OVR are typical indicators that there are packet errors at the driver level. If you see Drops or Overruns then it could mean that the kernel is not pulling packets fast enough or the driver is not able to keep up with the workload. In the case of the driver, it would be a good time to reach out to the vendor for support and see if there are any firmware/driver updates or settings that can be implemented to improve performance here.

ifconfig <interface name>

[root@gpdb-sandbox ~]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:0C:29:F7:B1:14 
 inet addr:172.16.34.128 Bcast:172.16.34.255 Mask:255.255.255.0
 inet6 addr: fe80::20c:29ff:fef7:b114/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:84887 errors:0 dropped:0 overruns:0 frame:0
 TX packets:28074 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000 
 RX bytes:43123224 (41.1 MiB) TX bytes:5168804 (4.9 MiB)

netstat -i <interface name>

[root@mdw ~]# netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0 618457118      0      0      0 644714020      0      0      0 BMRU
eth1       1500   0 106690824      0      0      0 13942215      0      0      0 BMRU
eth4       1500   0 579489994      0      0      0  4526273      0      0      0 BMRU
eth5       1500   0     5145      0      0      0       11      0      0      0 BMRU
lo        16436   0  1040195      0      0      0  1040195      0      0      0 LRU
vmnet1     1500   0        0      0      0      0        6      0      0      0 BMRU
vmnet8     1500   0  1896834      0      0      0        6      0      0      0 BMRU

Inspecting at the kernel and application level

tpcdump operates at the kernel level and can be useful for inspecting TCP based applications. For information on how and when to use this tool, refer to Running TCPDUMP to debug distributed applications.

"netstat -s" will dump out all of the kernel counters and provides detailed information related to the UDP/TCP kernel stack counters.

If the TCP-collapsed counter is incrementing, then it may indicate the application is not reading packets off of the kernel buffer fast enough. While the kernel is dropping packets here, it means the application could very well be the bottleneck and we need to determine if the problem is simply workload-related.
```
netstat -st |egrep -i collapsed
[root@mdw ~]# netstat -st | egrep -i collapsed
    3190 packets collapsed in receive queue due to low socket buffer
```

If we see a lot of network retransmissions, it usually indicates that the receiver is not acknowledging our packets fast enough or the receiver has never received the original packet as it was dropped somewhere along the way.

netstat -st |egrep -iretrans
[root@mdw ~]# netstat -st | egrep -i retrans
    81145 segments retransmited
    51597 fast retransmits
    25709 forward retransmits
    1750 retransmits in slow start
    62 sack retransmits failed

If UDP is reporting received errors, then that is similar to TCP collapsed errors where the application is not reading data fast enough from the kernel. You might see this counter go up in cases where there is a processing skew in GPDB because all segments are sending data over UDP to a single segment.
```
netstat -su |egrep error
[root@mdw ~]# netstat -su | egrep error
    0 packet receive errors
```