url_response probe codes and errors for many urls including 401 authentication

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

We are getting 401 auth alarms and many other types of alarms/errors since 7/27/2021. Hundreds of customer urls are being monitored by the url_response probe and had been functioning fine for quite some time.

Then after July 27, 2021 many alarms/errors started occurring for many but not all urls.

Downgrading to url_response 4.43 quieted down one instance of he probe and the included urls, but in other instances the customer still saw alarms being generated and errors in the log.

We tested the main url, the url without the port and also did some network tests. In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website or we could reach it but the connection timed out or errored (Application error). telnet to the hostname at port also failed.

Other observations:

- The url_response probe when deactivated takes up to 15 minutes to let go of its port.

- Log viewer crashed frequently.

Here are some of the errors that the customer was seeing frequently:

12007 The server name could not be resolved.
============================================
Internet (error) code:
12007 Name Not Resolved
The server name could not be resolved.

35 SSL connect error
====================
curl error code
CURLE_SSL_CONNECT_ERROR (35)
A problem occurred somewhere in the SSL/TLS handshake. You really want the error buffer and read the message there as it pinpoints the problem slightly more. Could be certificates (file formats, paths, permissions), passwords, and others.

403 The server understood the request, but is refusing to fulfill it.
=====================================================
HTTP Error code 403
The HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden. The server understood the request, but will not fulfill it.

The 403 Forbidden Error happens when the web page (or other resource) that you're trying to open in your web browser is a resource that you're not allowed to access. ... The second reason is that the owners of the web server have improperly set up permissions and you're getting denied access when you really shouldn't be.

12031 Unknown error: 12031
==========================
Internet code:
12031 Connection Reset
The connection with the server has been reset.

12031 indicates that the connection to the server is reset or not working properly and it is caused by the poor internet connection. Furthermore, the error 12031 may also occur due to invalid registry entries, outdated drivers or maybe a Firewall issue.

1200nnn with different digits:
==============================
See more Internet/Internet error codes here...
https://docs.oracle.com/cd/E25293_01/doc.910/e15484/oltappxa.htm

Environment

Release: DX UIM 20.4 or higher

Component: UIM - URL_RESPONSE

- url_response v4.46/4.47

- robot v9.32

- hub v9.31

Cause

possible network routing, security software, or anti-virus interference
possible probe overload due to the number of profiles (422) monitoring at 5 minute intervals
configuration issues
url_response.cfg corruption

Resolution

Here is a summary of tasks that should be implemented:

url_response probe troubleshooting

1. Test URLs
- Search for and find the original list of urls provided by the customer, and test the urls from the client robot TO the target because in each and every case we tested, we could not successfully reach the server e.g., 500 errors, or even get a response from the web server/web site.

Perhaps in some cases, some of the network traffic has been rerouted OR some traffic is being blocked, e.g., HTTP/HTTPS responses but the customer's network team needs to do a full investgation which should also iuncldue examing the firewall logs when the probe is running through its monitoring interval.

2. Web Server Application Errors
- Research and contact the Application owner(s) and correct any application errors received from the server, e.g., "Server error in '/' Application" (see example included in the screen shot)

3. Connectivity
- In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website and/or we could reach it but the connection timed out or errored (application error).

- Also, telnet to the hostname at port should not fail.

4. Scalability

- Try running another instance of the url_response probe from a robot-machine that allows you to access most if not all of those websites from a browser without any unexpected errors like you're currently seeing.

- Split up the profiles by 50% across 2 instances of the url_response probe across 2 robots if there are hundreds of profiles and the issues just started occurring recently.

5. url_response.cfg corruption
- Remove/replace any corrupt/bogus values for qos_source, e.g., where qos_source = @%&*-@%&*

6. Illegal URL errors
- These errors are coming from the fact that the url is specified without http:// or https://, so if you add them to the url those errors will stop, e.g.,

Change->

staging.xxxxxxxxxexample.com

to

http://staging.xxxxxxxxxexample.com

7. DNS Resolution
- Plenty of hostnames cannot be resolved so first make sure you can reach them from the local browser on the robot where url_response is deployed.

- You may also want to test via nslookup <hostname> as well

Research possible network changes /HTTP/HTTPS response traffic being blocked, network routing, security software, or anti-virus interference, or an intermediate firewall.

Additional Information

There can still be some other issue with monitoring and/or network, or security/filtering for the error responses being received, because the sheer number and variety of errors being received from the websites is extremely unusual. Since we received the same/similar error responses from hitting those urls outside of the probe/product, there is some other factor involved.

Another option is to try running the url_response probe from a different machine that allows you to access most if not all of those websites from a browser without any unexpected errors.

Note that currently, the url_response probe has no feature that allows you to exclude specific errors from generating alarms other than deactivating one or more of the the url_response profiles.

You can still use a nas preprocessing rule to exclude (delete) any alarms you don't care about or want to see, and only send an email or take some other desired action when the error 28 your interested in occurs. For that you can use a nas Auto Operator profile and a message filter that uses an AND operator REGEX like:

/(.*failed.*)(.*Timeout was reached.*)/

url_response probe troubleshooting

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/ca-unified-infrastructure-management-probes/GA/monitoring/systems-and-service-response/url-response-url-endpoint-response-monitoring/url-response-troubleshooting.html