Appliance unreachable and unresponsive hours after successful SGOS upgrade to 7.3.7.1
search cancel

Appliance unreachable and unresponsive hours after successful SGOS upgrade to 7.3.7.1

book

Article ID: 240100

calendar_today

Updated On:

Products

ASG-S500

Issue/Introduction

We successfully upgraded this appliance from 6.7.5.16 to 7.3.7.1.

We followed the given upgrade-path 6.7.5.16 > 7.2.1.1 > 7.2.5.1 > 7.2.6.1 > 7.3.7.1.

Everything went well until after about 6 hours, we Observed the following situation:

1/ the appliance is responding to ICMP ping

2/ no failover state change

3/ the appliance seems in a Zombie mode with no longer any Way to HTTP, HTTPS, and/SSH protocol in ANy of the management and Proxy-Services

4/ After almost One hour, the appliance stopped responding completely, ..even no ping.

5/ We finally took the decision to manually Power Off/On the appliance.

6/ After Reboot, no Core/Host Dumped on the appliance.

7/ We uploaded the Sysinfo into this ticket for further Analysis.

Question 1. Does that ring a Bell?

Question 2. Does it look like ANy Known Bug with 7.3.7.1?

Environment

Release: 7.3.7.1

Resolution

Beginning from SGOS 6.7.x, there are 2 new features introduced called TSO (Transmit segment offload) and Hardware checksum offload (Transmit checksum). These features are enabled by default. More information on TSO can be found here and hardware checksum offload can be found in the resource doc. with the URL below.

https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Checksum_offload

While these features would help to improve the performance of the TCP\IP stack of SGOS by offloading these tasks to the NIC card (SG's Network Adaptor), In some deployments it has been observed that the NIC card's transmit (TX) queue gets full and packet gets dropped or not processed in a timely manner. In other words, the packet does not leave the SG/ASG. When this situation happens and packets like ARP request does not leave from SG's NIC, the device will lose connection to the default gateway. This will make the SG unreachable from outside the network and as a result it may appear to hang or unresponsive over the network but will respond via serial console. Without any change when downgrading back to the previous SGOS version, this problem would be resolved. Cold bootup would also appear to resolve the issue.

When proxySG/ASG has the following conditions true, it's more likely that the SG might encounter this problem

  • The device has an active 10G Fiber/copper NIC
  • Deployment with a high volume of intercepted and/or bypassed packets on that 10G NIC.

Note 1: if the ProxySG/ASG has more than one active interface other than the 10G interface (i.e int 0:0 as management interface), It would be reachable via that interface while this issue occurs.

Note 2: There are no logs (i.e sysinfo file/snapshot, eventlog) that would indicate this problem other than the full memory core. Full core needs to be obtained from the device when the device or the 10G NIC is in a hung or unresponsive state.

In the SGOS 7.3.7.1, we recommend implementing the CLI command set below, to disable these features.

#conf t
#(config)tcp-ip tcp-tso disable
#(config)tcp-ip transmit-checksum disable

Note 3 - While these features are disabled, these tasks are still being performed by SGOS TCP/IP stack instead of the proxySG/ASG's NIC.

Note 4 - These CLI commands are hidden CLI commands and will not be displayed under available CLI commands with '?' or on an attempt to auto-populate by pressing the tab key. When these changes are made, it is stored in SG's configuration permanently and preserved upon reboot or upgrade to higher SGOS versions.

Also, to prevent the ASG from responding slowly to user traffic, we recommend also disabling LRO by running the CLI command set below.

#conf t
#(config)tcp-ip tcp-lro disable

Note 5: The observed issue isn't a bug and, looking at the logs and from the heartbeat reaching the backend, we have also confirmed that this was not a crash. S

We expect the recommended changes to resolve the issue permanently. Monitor the changes for a few days and let Technical Support know, should you have further related queries. Please note that this issue is one that would rarely recur.