Server has been down for hours but it's online and connected in UIM

search cancel

Server has been down for hours but it's online and connected in UIM

book

Article ID: 256470

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Server stopped responding as of 12/15/2022 4:06:57 AM -05:00 - (UTC-05:00) while it's online in UIM.
telnet with port 48000 is connecting fine.
Could you confirm the steps for locating the cause of the issue?
There are also ntservice monitors configured for the server which may have failed as well.

Environment

Release: 20.4 or higher

Resolution

The issue was first evidenced by the inability to RDP to the machine.
Test alarm executed from the controller makes its way to the hub with no issue so there may have been a transient network connectivity/communication issue for some time last night.
The best practice of using the first_probe_port default of 48000 was not set/in place, so we added it into the robot.cfg. Note that when a robot is installed this IS the default so someone must have unset this parameter manually.
Most likely you won't have a problem with this robot anymore but if some unexpected lack of connectivity or communication occurs, some special monitoring (self-monitoring) can be put into place to help troubleshoot it.
If the robot (controller) was completely hung, not writing to the logs/filesystem, etc., then this may have been the cause of not receiving a Robot inactive alarm.

Please refer to the following for some suggestions:

Best Practices for monitoring DX UIM - self-health monitoring

Without being able to review the controller logs from the given time frame in which the event occurred where the robot appeared to be up, we can only guess as to what may have occurred at that time.

Additional Information

Suggestion moving forward to detect if there really is an issue with the robot or the network/routing/latency etc.

-> Use ping -a or a batch script overnight to ping the machine and see if it goes down and if it does then check the Windows Event logs (Application and System) to see what happened right before the machine became unresponsive.

-> as a start -> implement monitoring of the robot IP address via net_connect to ping the machine over time and see if there are any failures to speak of.

Feedback

thumb_up Yes

thumb_down No