The term "hung state" is too vague. This could mean many different things, all with different possible solutions:
For example, symptoms may include but are not limited to the following:
- A single application on the system is not responding
- You cannot log in remotely with RDP, yet all other functions are working.
- All web services are down but other services are working, e.g., print services
- UIM robot is installed for monitoring, but the controller.log shows that nothing is being written to the log
- Disk space is very low or disk/drive or file system may be full
In this case, an application is in a bad state but the OS is still functioning and responding to ping requests.
There is no single probe that can handle all of these scenarios and send an alarm for them.
First, you will need to define what constitutes a 'hung state' and then see what can be done to monitor for that condition.
You may choose one or more options to monitor a server/robot/system that appears to be hung and not functional.
- net_connect probe
- If the application or server is not 'responding,' you can monitor it with the net_connect probe to test availability by monitoring any given service that is expected to be running, and using its default port, e.g., RDP port 3389, WMI port 135, SSH port 22 (Linux/Unix), etc.
- Note that if a robot is installed, and the UIM robot is in a hung state, the spooler that listens on port 48001 will not be responsive, so you can use the net_connect probe on a hub to check a robot on its default port 48001, and send an alert if the service is unresponsive.
- sql_response (or sqlserver) probe
- If an application should be inserting data into a database, possibly create an SQL query or checkpoint to check for records being added and alarm if no new records/rows are being added.
- dirscan probe
- If the application writes to a log file, set up a remote dirscan to monitor the log files for activity.
- e2e_appmon probe
- If the application or server is not 'responding,' monitor it with the e2e_appmon probe using a simple script
Once you clearly define what a 'hung state' is in your case and how you can most effectively check for and alert on this problem, then a possible solution may be able to be found.