Many robots disappeared from IM and AC: failed to send alive to hub

search cancel

Many robots disappeared from IM and AC: failed to send alive to hub

book

Article ID: 261117

calendar_today

Updated On: 03-09-2023

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

A big number of robots have disappeared from IM and AC. This happened on Windows and Linux robots all it logs is:

Jan  1 20:30:06:603 [9080] 0 Controller: failed to send alive to hub uimhub (10 xx.xxx.xx) - communication error
Jan  1 20:30:06:603 [9080] 0 Controller: failed to send alive (async) to hub uimhub (10 xx.xxx.xx) - communication error
Jan  1 20:30:08:606 [9080] 0 Controller: failed to send alive to hub uimhub

• Robot restart ran locally fixes the issue. (Also a restarting the robot remotely using pu.exe fixes the issue)

• Telnet works bidirectionally. hub to robots
• Connect robot from IM succeeds for the missing robot.

• Customer has a list of robots logged in the log of "getRobots" custom probe log.

What could have caused this?

Environment

Release : UIM 20.x, any robot version

Cause

A high memory consumption (80-90%) on the primary hub can be a potential cause of this issue.

For every port_alive_check time interval (5+ min by default), the controller will try to check if probes are running by triggering _status callback on the probe.

If _status command on a probe fails for consecutive "port_status_check_timer" times (3 by default) then the controller treats that probe to be down and removes the port from the list. During this time we may see the error "Port dropped:".

Example log in controller log:

 "Jan  1 07:37:00:398 [9080] 0 Controller: Port dropped: uim_probe 48019".

This means the controller could not reach the probe for at least 15 minutes.

The controller in its next cycle will check if there is a probe that is active and not running, in the above case uim_probe is dropped but is active in the controller.cfg, the controller will try to start that probe.
If it starts OK, else if the probe fails to start for 20 retries the controller will declare that probe is down.

Resolution

It is possible to increase the discussed parameters to avoid/limit this issue.

Add the parameters "port_alive_check time" and "port_status_check_timer" in the robot.cfg file - under the controller section - and increase the values from the default numbers.

Defaults:

port_alive_check_time = 5min

port_status_check_timer = 3

If many of the probes are getting dropped, it means that they were not accessible for a considerable amount of time. A high constant 80-90% memory consumption on the local server can be a potential reason for this behavior. Increase the resources on the affected server to avoid these issues.

Feedback

thumb_up Yes

thumb_down No