We are getting 401 auth alarms and many other types of alarms/errors since 7/27/2021. Hundreds of customer urls are being monitored by the url_response probe and had been functioning fine for quite some time.
Then after July 27, 2021 many alarms/errors started occurring for many but not all urls.
Downgrading to url_response 4.43 quieted down one instance of he probe and the included urls, but in other instances the customer still saw alarms being generated and errors in the log.
We tested the main url, the url without the port and also did some network tests. In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website or we could reach it but the connection timed out or errored (Application error). telnet to the hostname at port also failed.
- The url_response probe when deactivated takes up to 15 minutes to let go of its port.
- Log viewer crashed frequently.
- possible network routing, security software, or anti-virus interference
- possible probe overload due to the number of profiles (422) monitoring at 5 minute intervals
- configuration issues
- url_response.cfg corruption
Here is a summary of tasks that should be implemented:
url_response probe troubleshooting
1. Test URLs
- Search for and find the original list of urls provided by the customer, and test the urls from the client robot TO the target because in each and every case we tested, we could not successfully reach the server e.g., 500 errors, or even get a response from the web server/web site.
Perhaps in some cases, some of the network traffic has been rerouted OR some traffic is being blocked, e.g., HTTP/HTTPS responses but the customer's network team needs to do a full investgation which should also iuncldue examing the firewall logs when the probe is running through its monitoring interval.
2. Web Server Application Errors
- Research and contact the Application owner(s) and correct any application errors received from the server, e.g., "Server error in '/' Application" (see example included in the screen shot)
- In some cases, even when ping, nslookup and tracert gave us good responses, we could not reach the web server/website and/or we could reach it but the connection timed out or errored (application error).
- Also, telnet to the hostname at port should not fail.
- Try running another instance of the url_response probe from a robot-machine that allows you to access most if not all of those websites from a browser without any unexpected errors like you're currently seeing.
- Split up the profiles by 50% across 2 instances of the url_response probe across 2 robots if there are hundreds of profiles and the issues just started occurring recently.
5. url_response.cfg corruption
- Remove/replace any corrupt/bogus values for qos_source, e.g., where qos_source = @%&*-@%&*
6. Illegal URL errors
- These errors are coming from the fact that the url is specified without http:// or https://, so if you add them to the url those errors will stop, e.g.,
7. DNS Resolution
- Plenty of hostnames cannot be resolved so first make sure you can reach them from the local browser on the robot where url_response is deployed.
- You may also want to test via nslookup <hostname> as well
Research possible network changes /HTTP/HTTPS response traffic being blocked, network routing, security software, or anti-virus interference, or an intermediate firewall.