We are getting 401 auth alarms and many other types of alarms/errors since 7/27/2021. Hundreds of customer urls are being monitored by the url_response probe and had been functioning fine for quite some time.
Then after July 27, 2021 many alarms/errors started occurring for many but not all urls.
Downgrading to url_response 4.43 quieted down one instance of he probe and the included urls, but in other instances the customer still saw alarms being generated and errors in the log.
We tested the main url, the url without the port and also did some network tests. In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website or we could reach it but the connection timed out or errored (Application error). telnet to the hostname at port also failed.
Other observations:
- The url_response probe when deactivated takes up to 15 minutes to let go of its port.
- Log viewer crashed frequently.
- possible network routing, security software, or anti-virus interference
- possible probe overload due to the number of profiles (422) monitoring at 5 minute intervals
- configuration issues
- url_response.cfg corruption
Here is a summary of tasks that should be implemented:
url_response probe troubleshooting
1. Test URLs
- Search for and find the original list of urls provided by the customer, and test the urls from the client robot TO the target because in each and every case we tested, we could not successfully reach the server e.g., 500 errors, or even get a response from the web server/web site.
Perhaps in some cases, some of the network traffic has been rerouted OR some traffic is being blocked, e.g., HTTP/HTTPS responses but the customer's network team needs to do a full investgation which should also iuncldue examing the firewall logs when the probe is running through its monitoring interval.
2. Web Server Application Errors
- Research and contact the Application owner(s) and correct any application errors received from the server, e.g., "Server error in '/' Application" (see example included in the screen shot)
3. Connectivity
- In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website and/or we could reach it but the connection timed out or errored (application error).
- Also, telnet to the hostname at port should not fail.
4. Scalability
- Try running another instance of the url_response probe from a robot-machine that allows you to access most if not all of those websites from a browser without any unexpected errors like you're currently seeing.
- Split up the profiles by 50% across 2 instances of the url_response probe across 2 robots if there are hundreds of profiles and the issues just started occurring recently.
5. url_response.cfg corruption
- Remove/replace any corrupt/bogus values for qos_source, e.g., where qos_source = @%&*-@%&*
6. Illegal URL errors
- These errors are coming from the fact that the url is specified without http:// or https://, so if you add them to the url those errors will stop, e.g.,
Change->
staging.xxxxxxxxxexample.com
to
http://staging.xxxxxxxxxexample.com
7. DNS Resolution
- Plenty of hostnames cannot be resolved so first make sure you can reach them from the local browser on the robot where url_response is deployed.
- You may also want to test via nslookup <hostname> as well
Research possible network changes /HTTP/HTTPS response traffic being blocked, network routing, security software, or anti-virus interference, or an intermediate firewall.
There can still be some other issue with monitoring and/or network, or security/filtering for the error responses being received, because the sheer number and variety of errors being received from the websites is extremely unusual. Since we received the same/similar error responses from hitting those urls outside of the probe/product, there is some other factor involved.
Another option is to try running the url_response probe from a different machine that allows you to access most if not all of those websites from a browser without any unexpected errors.
Note that currently, the url_response probe has no feature that allows you to exclude specific errors from generating alarms other than deactivating one or more of the the url_response profiles.
You can still use a nas preprocessing rule to exclude (delete) any alarms you don't care about or want to see, and only send an email or take some other desired action when the error 28 your interested in occurs. For that you can use a nas Auto Operator profile and a message filter that uses an AND operator REGEX like:
/(.*failed.*)(.*Timeout was reached.*)/
url_response probe troubleshooting