search cancel

WCCP network port flapping- "WCCP: Router x.x.x.x is no longer ACTIVE for service group 99 due to unresponsiveness" 0 1E0000:96 format.cpp:549.

book

Article ID: 239989

calendar_today

Updated On:

Products

ProxySG Software - SGOS

Issue/Introduction

A network port flapping issue which shows a log message as: "WCCP: Router x.x.x.x is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549.

Resolution

From investigating the logs, the following would be seen, for the specific case.

2022-04-05 15:27:52+03:00AST  "Snapshot debug-cfssl has fetched /cfSSL/debug"  0 2C0006:96  snapshot_worker.cpp:237
2022-04-05 15:34:07+03:00AST  "WCCP: Router 10.28.0.1 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:35:30+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:40:07+03:00AST  "WCCP: Router 10.28.0.1 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:40:07+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:40:07+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:40:07+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:40:07+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:40:07+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:40:08+03:00AST  "Health Monitor (CRITICAL): CPU Utilization above 95 percent for last 120 seconds"  63 6F0102:1  notifier.cpp:111
2022-04-05 15:40:08+03:00AST  "Dynamic categorization error: Service is not healthy."  0 500000:1  cerberian_api.cpp:110
2022-04-05 15:40:09+03:00AST  "Dynamic categorization error: Service is not healthy."  0 500000:1  cerberian_api.cpp:110
2022-04-05 15:40:13+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:40:14+03:00AST  "Access Log FTP (main): Connecting to primary 10.28.38.55 server 10.28.38.55:21."  0 E0000:96  alog_ftp_client.cpp:153

2022-04-05 15:41:17+03:00AST  "Access Log FTP (main): 150 Ok to send data."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:41:30+03:00AST  "Snapshot debug-cfssl has fetched /cfSSL/debug"  0 2C0006:96  snapshot_worker.cpp:237
2022-04-05 15:41:30+03:00AST  "WCCP: Router 10.28.0.1 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:41:30+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:41:31+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:41:46+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:42:15+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:42:17+03:00AST  "Dynamic categorization error: Service is not healthy."  0 500000:1  cerberian_api.cpp:110
2022-04-05 15:42:27+03:00AST  "WCCP: Router 10.28.0.1 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:42:30+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:42:48+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:42:48+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:42:48+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:42:48+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:43:02+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:43:19+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:43:31+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:43:31+03:00AST  "Dynamic categorization error: Service is not healthy."  0 500000:1  cerberian_api.cpp:110
2022-04-05 15:43:34+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:43:34+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:43:35+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:43:35+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:44:27+03:00AST  "NTP: Receive timeout, retrying NTP Server: 10.28.38.19"  0 90000:96  ntp.cpp:937

2022-04-05 15:44:49+03:00AST  "WCCP: Router 10.28.0.1 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:44:49+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:44:50+03:00AST  "NTP: Receive timeout, retrying NTP Server: 10.28.38.20"  0 90000:96  ntp.cpp:937
2022-04-05 15:44:52+03:00AST  "NTP: Wrong originate timestamp in response received from NTP Server: 10.28.38.20. Sent 3858151490.1132621539, received 3858151489.1064411931"  0 90000:1  ntp.cpp:900
2022-04-05 15:44:52+03:00AST  "NTP: Periodic query of server 10.28.38.20, time within acceptable variance, 0 seconds, 197 ms slow compared to NTP time."  0 90000:96  ntp.cpp:1121
2022-04-05 15:44:52+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:44:52+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:44:52+03:00AST  "Failed trying to download the Categorization subscription file"  0 AF0000:1  loader_impl.cpp:800
2022-04-05 15:44:58+03:00AST  "Access Log FTP (main): 226 Transfer complete."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:45:12+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:45:16+03:00AST  "Access Log FTP (main): QUIT"  0 E0000:96  alog_ftp_client.cpp:1633
2022-04-05 15:45:21+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:45:23+03:00AST  "Access Log FTP (main): 221 Goodbye."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:45:41+03:00AST  "Access Log (main): Upload completed successfully.  Maximum bandwidth used was 62.00 KB/sec."  0 E0009:96  alog_manager.cpp:1236
2022-04-05 15:45:41+03:00AST  "Access Log (main): Last remote filename: SG_main__210405124012.log.gz size: 63160 bytes"  0 E0009:96  alog_manager.cpp:1242
2022-04-05 15:45:43+03:00AST  "Download of IWF database failed"  0 500098:1  cfs_admin.cpp:760
2022-04-05 15:45:43+03:00AST  "WCCP: Router 10.28.0.1 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:45:43+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:45:43+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:45:43+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:45:44+03:00AST  "Access Log FTP (main): Connecting to primary 10.28.38.55 server 10.28.38.55:21."  0 E0000:96  alog_ftp_client.cpp:153
2022-04-05 15:45:45+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:45:45+03:00AST  "Snapshot debug-cfssl has fetched /cfSSL/debug"  0 2C0006:96  snapshot_worker.cpp:237
2022-04-05 15:45:47+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:45:48+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:45:48+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:45:58+03:00AST  "Access Log FTP (main): 220 Welcome to the Reporter FTP service."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:45:58+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:45:58+03:00AST  "Access Log FTP (main): USER ftplogs"  0 E0000:96  alog_ftp_client.cpp:466
2022-04-05 15:46:26+03:00AST  "Access Log FTP (main): 331 Please specify the password."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:46:29+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:46:32+03:00AST  "Access Log FTP (main): PASS *****"  0 E0000:96  alog_ftp_client.cpp:613
2022-04-05 15:46:46+03:00AST  "WCCP: Router 10.28.0.1 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:46:50+03:00AST  "Access Log FTP (main): 230 Login successful."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:46:50+03:00AST  "Access Log FTP (main): CWD /Proxy01  "  0 E0000:96  alog_ftp_client.cpp:797
2022-04-05 15:46:59+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:46:59+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:46:59+03:00AST  "Access Log FTP (main): 250 Directory successfully changed."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:46:59+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:46:59+03:00AST  "Access Log FTP (main): TYPE I"  0 E0000:96  alog_ftp_client.cpp:692
2022-04-05 15:46:59+03:00AST  "WCCP: Router 10.28.0.1 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:46:59+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:46:59+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:46:59+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:46:59+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:46:59+03:00AST  "Access Log FTP (main): 200 Switching to Binary mode."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:46:59+03:00AST  "Access Log FTP (main): PASV"  0 E0000:96  alog_ftp_client.cpp:1042
2022-04-05 15:47:05+03:00AST  "Access Log FTP (main): 227 Entering Passive Mode (10,28,38,55,117,66)."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:47:05+03:00AST  "Access Log (Logrhythm): Unable to connect to remote server for log uploading"  0 E0008:1  alog_facility_impl.cpp:2804
2022-04-05 15:47:05+03:00AST  "Access Log FTP (main): Creating data socket to 10.28.38.55:30018"  0 E0000:96  alog_ftp_client.cpp:1156
2022-04-05 15:48:12+03:00AST  "Access Log FTP (main): STOR SG_main__210405124542.log.gz"  0 E0000:96  alog_ftp_client.cpp:1331
2022-04-05 15:48:12+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:12+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:12+03:00AST  "Access Log FTP (main): 150 Ok to send data."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:48:28+03:00AST  "WCCP: Router 10.28.0.1 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:48:28+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:28+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:28+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:28+03:00AST  "Failed to get session pointer"  0 300000:1  cfssl_cert.cpp:813
2022-04-05 15:48:51+03:00AST  "WCCP: Router 10.28.0.2 is no longer ACTIVE for service group 99 due to unresponsiveness"  0 1E0000:96  format.cpp:549
2022-04-05 15:49:14+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:49:14+03:00AST  "WCCP: Router 10.28.0.2 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:49:29+03:00AST  "WCCP: Service group 99 is now in READY state"  0 1E0000:96  format.cpp:2610
2022-04-05 15:49:34+03:00AST  "WCCP: Router 10.28.0.1 is now ACTIVE for service group 99"  0 1E0000:96  format.cpp:2304
2022-04-05 15:49:34+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:49:34+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:49:34+03:00AST  "WCCP: Received Removal Query from router 10.28.0.1 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:49:34+03:00AST  "Access Log FTP (main): 226 Transfer complete."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:49:34+03:00AST  "Access Log FTP (main): QUIT"  0 E0000:96  alog_ftp_client.cpp:1633
2022-04-05 15:49:34+03:00AST  "WCCP: Received Removal Query from router 10.28.0.2 for service group 99"  0 1E0000:96  format.cpp:1906
2022-04-05 15:49:34+03:00AST  "Access Log FTP (main): 221 Goodbye."  0 E0000:96  alog_ftp_client.cpp:1745
2022-04-05 15:49:34+03:00AST  "Access Log (main): Upload completed successfully.  Maximum bandwidth used was 5.00 KB/sec."  0 E0009:96  alog_manager.cpp:1236

The WCCP router reports "Unresponsiveness" for service group 99. Following this, the WCCP router sends a "Removal Query" to the Proxy. To understand why this is happening, it's important to understand the structure and the workings of the WCCP Service Group.

WCCP Service Group

A service group unites one or more routers/switches with one or more caching devices (ProxySG appliances in this case) in a transparent redirection scheme governed by a common set of rules. The service group members agree on these rules initially by announcing their specific capabilities and configurations to each other in WCCP protocol packets as follows:

1. The ProxySG appliance sends out a “Here I Am” (WCCP2_HERE_I_AM) message to the routers in the group. These messages include a description of the service group that the ProxySG wants to join, including the protocol, ports to redirect, the method to use to forward and return packets to each other, and load balancing instructions.

2. The routers respond with an “I See You” (WCCP2_I_SEE_YOU) message that includes a Receive ID as well as a list of WCCP capabilities - such as forwarding/return methods or load balancing schemes - that the router supports.

3. The ProxySG appliance responds with another “Here I Am” message in which it reflects the Receive ID that was sent in the “I See You” message from the router. In addition, the ProxySG examines the capabilities advertised by the router and, if its configuration specifies a capability that has not been advertised, it will abandon its attempt to join the service group. If the capabilities it is configured to use are advertised, it will select the capabilities it wants to use and will send them back to the router in another “Here I Am” message.


4. The router inspects the capabilities that the ProxySG selected and, if the capabilities are supported, the router accepts the ProxySG as compatible and adds it to the service group. The router responds to all ProxySG appliances that it has accepted with “I See You” messages that include a listing of all ProxySG
appliances in the service group (called the router view).

5. Each ProxySG in the group periodically sends out “Here I Am” messages to the routers in the group to maintain its service group membership. If a router doesn’t receive a “Here I Am” message from a ProxySG in the group within the designated time-out interval, it removes the ProxySG from the service group and sends out an “I See You” with an updated router view.

From what we see in the logs, the WCCP router appears to not have received the “Here I Am” messages from the Proxy, hence, the "Removal Query" it sends to the Proxy. To help us understand why this may have happened, it's important that we view and investigate the "state information" for the affected service group 99.

The ProxySG maintains state information for its configured service groups. The state of a service group helps you monitor whether the service group was configured properly and how it is functioning.

After you install the WCCP configuration, the WCCP routers and ProxySG appliances in the service groups you have defined begin negotiating the capabilities you have configured. As long as the configurations you have defined are correct and all of the routers and ProxySG appliances in the group support the capabilities that have been configured and have the required network connectivity, the service group will form and the router will begin redirecting traffic to the ProxySG appliances in the service group.

You can monitor statistics about the service groups you have configured on a given ProxySG from the Management Console or from the CLI. For guidance on how you may access the Service Group Statistics from the ProxySG management console, please refer to the guidance shown in the snippet below.

https://api-broadcom-ca.wolkenservicedesk.com/attachment/get_attachment_content?uniqueFileId=kv52NUMizrlG+nuN9beoXw==

https://api-broadcom-ca.wolkenservicedesk.com/attachment/get_attachment_content?uniqueFileId=CO7VkH3SGWEpwhDzANmXQA==

Please collect and share the results, from the above to the ticket, for further analysis. Also, please collect the PCAP for web traffic reaching the WCCP router, from the client, through the ProxySG. For the guidance on how to capture and display WCCP negotiation packets, please refer to the guidance provided in the Tech. Article with the URL below.

https://knowledge.broadcom.com/external/article/166315/how-to-capture-and-display-wccp-negotiat.html

For the additionally required troubleshooting data, please refer to the guidance detailed in the Tech. Article with the URL below.

https://knowledge.broadcom.com/external/article/166358/collect-troubleshooting-data-for-an-issu.html 

Additional doc.: https://knowledge.broadcom.com/external/article/170571/wccp-service-group-states.html

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

From the WCCP statistics/data provided and partly shown in the snippet below, we see absolutely no issues with the ProxySG. We see that the Proxy is "Ready" to negotiate and there is a smooth exchange of the "HERE I AM" & "I SEE YOU" messages between the ProxySG and the WCCP router. We stated that these messages are literally packets being exchanged and they have very unique components/parameters that must be present and negotiable.

The “Here I Am” (WCCP2_HERE_I_AM) messages include a description of the service group that the ProxySG wants to join, including the protocol, ports to redirect, and the method to use to forward and return packets to each other, and load balancing instructions. All of the components/parameters herein are configurable and it's only when these configurations are in place that the ProxySG will be in the "READY" state. The ProxySG is in the READY state and we do not see a configuration issue.

The “I See You” (WCCP2_I_SEE_YOU) message includes a Receive ID as well as a list of WCCP capabilities - such as forwarding/return methods or load balancing schemes - that the router supports. Where any of these parameters are missing, the connection could drop and the link may go down, even temporarily.

We note that the ProxySG and the WCCP router will continue to exchange these messages, to keep their communication alive. In this case, it's the WCCP: Router x.x.x.x that was no longer ACTIVE for service group 99 due to unresponsiveness". To determine exactly why the Router was unresponsive to WCCP packets from the ProxySG, A PCAP would always be required, for the WCCP packets sent from the ProxySG to the WCCP router.

With a clear understanding that the connection/link flap happened only temporarily, for minutes, on the same day the issue was observed and that the connection got restored without any change made on the ProxySG. Please note that this clearly pointed to a convergence issue in the network, at the time of the flap and never an issue with the Proxy. We recommend always having the PCAP collected from the Proxy, should the connection flap recur anytime.

On a query to know if this could have been a possible bug on the Proxy, our response remains a "No". The Proxy would not self-heal, where there is a bug. A bug is simply a broken code behind a specific feature on the product. Until the code is fixed, the symptoms of a bug would remain. This was clearly a network flap caused by temporary convergence issues. As soon as the network self-healed, the connection got restored. 

Attachments