We are facing controller communication issues for servers frequently and after, clearing niscache and Nimsoft Robot Service Restart, the issue was resolved. No other particular activity was/is happening in that server like patching, upgrade, etc., still new servers come up with controller communication issues. "Nimbus" service shows as running at the server end, still a restart is required to resolve controller communication issues.
Even if we resolve the issue, the next day we can see this controller communication issue for other new working servers.
"Unable to reach controller,
node:
/<domain>/<hub>/<robot>/controller
error message: communication error
and then running a telnet FROM the hub TO the robot on port 48000 fails to connect.
Once you restart the Nimsoft Robot Watcher Service on the machine, you can access the controller/controller GUI again but it's only temporary. It may fail in a few minutes, hours, or weeks.
Excerpts from controller.log:
Jun 3 23:04:57:786 [4552] 0 Controller: failed to send alive to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:04:57:786 [4552] 0 Controller: failed to send alive (async) to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:05:39:798 [4552] 0 Controller: failed to send alive to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:05:39:798 [4552] 0 Controller: failed to send alive (async) to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:06:00:798 [4552] 0 Controller: No contact with hub for prolonged time
Jun 3 23:06:21:795 [4552] 0 Controller: hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) NO CONTACT (communication error)
Jun 3 23:06:21:797 [4552] 0 Controller: Hub XXXXXXXXXFPKRHub02(xxx.xxx.xxx.xxx) contact established
Jun 4 04:58:02:141 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 4 05:13:00:616 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 4 05:14:29:677 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 9 01:44:29:014 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 9 01:45:57:060 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 15 16:05:52:262 [4552] 0 Controller: Going down...
Jun 15 16:06:00:427 [4552] 0 Controller: Down
Jun 22 02:40:43:181 [18632] 0 Controller: inst_execute_status: sending reply rc=0(OK)
Jun 22 02:40:45:851 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:46:853 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:47:854 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:48:855 [18632] 0 Controller: Going down...
Jun 22 02:40:48:862 [18632] 0 Controller: inst_execute_status: sending reply rc=0(OK)
Jun 22 02:40:56:985 [18632] 0 Controller: Down
The customer may wish to observe a subset of the upgraded robots previously exhibiting communication errors for 24 hours or so, and then if they remain accessible, update all of the other hubs and robots.
This and other similar issues with robots occurred originally due to hub v9.31, and in hub v9.33 it reoccurred.
Upgrade the hub. (Note you may also have to clear the niscache but this is a good step in any case).
If the robot is throwing the communication error and hence is inaccessible, restart the Nimsoft Robot Watcher Service so it becomes available again.
Then when the robot comes up, rt-click on the controller probe and choose Update Version...