Jan 1 20:30:06:603 [9080] 0 Controller: failed to send alive to hub uimhub (10 xx.xxx.xx) - communication error
Jan 1 20:30:06:603 [9080] 0 Controller: failed to send alive (async) to hub uimhub (10 xx.xxx.xx) - communication error
Jan 1 20:30:08:606 [9080] 0 Controller: failed to send alive to hub uimhub
• Robot restart ran locally fixes the issue. (Also a restarting the robot remotely using pu.exe fixes the issue)
• Telnet works bidirectionally. hub to robots
• Connect robot from IM succeeds for the missing robot.
• Customer has a list of robots logged in the log of "getRobots" custom probe log.
Release : UIM 20.x, any robot version
A high memory consumption (80-90%) on the primary hub can be a potential cause of this issue.
For every port_alive_check time interval (5+ min by default), the controller will try to check if probes are running by triggering _status callback on the probe.
If _status command on a probe fails for consecutive "port_status_check_timer" times (3 by default) then the controller treats that probe to be down and removes the port from the list. During this time we may see the error "Port dropped:".
Example log in controller log:
"Jan 1 07:37:00:398 [9080] 0 Controller: Port dropped: uim_probe 48019".
The controller in its next cycle will check if there is a probe that is active and not running, in the above case uim_probe is dropped but is active in the controller.cfg, the controller will try to start that probe.
If it starts OK, else if the probe fails to start for 20 retries the controller will declare that probe is down.
It is possible to increase the discussed parameters to avoid/limit this issue.
Add the parameters "port_alive_check time" and "port_status_check_timer" in the robot.cfg file - under the controller section - and increase the values from the default numbers.
Defaults:
port_alive_check_time = 5min
port_status_check_timer = 3
If many of the probes are getting dropped, it means that they were not accessible for a considerable amount of time. A high constant 80-90% memory consumption on the local server can be a potential reason for this behavior. Increase the resources on the affected server to avoid these issues.