How to monitor a server or UIM robot that seems to be in a hung state
search cancel

How to monitor a server or UIM robot that seems to be in a hung state

book

Article ID: 117683

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

We have implemented the net_connect probe to monitor server availabilty but the server seems to be in a hung state because ping is successful and working but the server is down/not reachable and not functioning as expected.

Environment

  • DX UIM, Any version
  • robot, any version
  • net_connect

Resolution

The term "hung state" is too vague. This could mean many different things, all with different possible solutions:

For example, symptoms may include but are not limited to the following:

  • A single application on the system is not responding
  • You cannot log in remotely with RDP, yet all other functions are working.
  • All web services are down but other services are working, e.g., print services
  • UIM robot is installed for monitoring, but the controller.log shows that nothing is being written to the log
  • Disk space is very low or disk/drive or file system may be full

In this case, an application is in a bad state but the OS is still functioning and responding to ping requests.

There is no single probe that can handle all of these scenarios and send an alarm for them.

First, you will need to define what constitutes a 'hung state' and then see what can be done to monitor for that condition.

You may choose one or more options to monitor a server/robot/system that appears to be hung and not functional.

  • net_connect probe

    • If the application or server is not 'responding,' you can monitor it with the net_connect probe to test availability by monitoring any given service that is expected to be running, and using its default port, e.g., RDP port 3389, WMI port 135, SSH port 22 (Linux/Unix), etc.

    • Note that if a robot is installed, and the UIM robot is in a hung state, the spooler that listens on port 48001 will not be responsive, so you can use the net_connect probe on a hub to check a robot on its default port 48001, and send an alert if the service is unresponsive.

  • sql_response (or sqlserver) probe

    • If an application should be inserting data into a database, possibly create an SQL query or checkpoint to check for records being added and alarm if no new records/rows are being added.

  • dirscan probe
    • If the application writes to a log file, set up a remote dirscan to monitor the log files for activity.

  • e2e_appmon probe

    • If the application or server is not 'responding,' monitor it with the e2e_appmon probe using a simple script

Once you clearly define what a 'hung state' is in your case and how you can most effectively check for and alert on this problem, then a possible solution may be able to be found.