controller communication error - Unable to reach controller in UIM 20.3 with hub & robot 9.33
search cancel

controller communication error - Unable to reach controller in UIM 20.3 with hub & robot 9.33

book

Article ID: 244573

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

We are facing controller communication issues for servers frequently and after, clearing niscache and Nimsoft Robot Service Restart, the issue was resolved. No other particular activity was/is happening in that server like patching, upgrade, etc., still new servers come up with controller communication issues. "Nimbus" service shows as running at the server end, still a restart is required to resolve controller communication issues.

Even if we resolve the issue, the next day we can see this controller communication issue for other new working servers.

"Unable to reach controller,
node:
/<domain>/<hub>/<robot>/controller
error message: communication error

 

 

and then running a telnet FROM the hub TO the robot on port 48000 fails to connect.

Once you restart the Nimsoft Robot Watcher Service on the machine, you can access the controller/controller GUI again but it's only temporary. It may fail in a few minutes, hours, or weeks.

Excerpts from controller.log:

Jun 3 23:04:57:786 [4552] 0 Controller: failed to send alive to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:04:57:786 [4552] 0 Controller: failed to send alive (async) to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:05:39:798 [4552] 0 Controller: failed to send alive to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:05:39:798 [4552] 0 Controller: failed to send alive (async) to hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) - communication error
Jun 3 23:06:00:798 [4552] 0 Controller: No contact with hub for prolonged time
Jun 3 23:06:21:795 [4552] 0 Controller: hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) NO CONTACT (communication error)
Jun 3 23:06:21:797 [4552] 0 Controller: Hub XXXXXXXXXFPKRHub02(xxx.xxx.xxx.xxx) contact established
Jun 4 04:58:02:141 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 4 05:13:00:616 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 4 05:14:29:677 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 9 01:44:29:014 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 9 01:45:57:060 [4552] 0 Controller: Hub XXXXXXXX<hub>(xxx.xxx.xxx.xxx) contact established
Jun 15 16:05:52:262 [4552] 0 Controller: Going down...
Jun 15 16:06:00:427 [4552] 0 Controller: Down

Jun 22 02:40:43:181 [18632] 0 Controller: inst_execute_status: sending reply rc=0(OK)
Jun 22 02:40:45:851 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:46:853 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:47:854 [18632] 0 Controller: inst_execute_status: sending reply rc=7(temporarily out of resources)
Jun 22 02:40:48:855 [18632] 0 Controller: Going down...
Jun 22 02:40:48:862 [18632] 0 Controller: inst_execute_status: sending reply rc=0(OK)
Jun 22 02:40:56:985 [18632] 0 Controller: Down

Environment

  • Release: UIM 20.3
  • Component: UIM - ROBOT 9.33
  • hub 9.33, hub_9.33_HF1 (but this HF did not resolve this issue)

Cause

  • hub 9.33
  • hub_9.33_HF1

Resolution

Ran multiple commands to analyze the robot connectivity and we did notice some network latency and DNS issues (many servers are not resolvable).

If you don't see any suspicious Windows events in the Application/System logs, please follow these steps:
 
1. Upgrade ALL hubs and robots to 9.35
 
2. Then do a 'clearout process' e.g.,

    a. Stop the hub robot
    b. Remove the robot.sds file (in <install_dir>\Program Files (x86)\Nimsoft\hub
    c. Wait at least 5 minutes
    d. Activate the hub-robot
 
Some stuck jobs may be interfering with the robot running. It could be that the distsrv maybe is just hammering it too hard with distributions and making it bog down when trying to start everything...
 
So to remove any stuck jobs,
 
3. Deactivate distsrv on local and remote hubs.
4. Open Raw Configure and navigate to the Tasks section, and if there are any installs listed (they will be listed using the name provided to create them), delete them. 
5. Restart distsrv on hub(s)
6. If jobs still persist in the "View Distribution Jobs" window within IM then try the following:
7. Restart the Robot on the Primary hub.

Additional Information

The customer may wish to observe a subset of the upgraded robots previously exhibiting communication errors for 24 hours or so, and then if they remain accessible, update all of the other hubs and robots.

This and other similar issues with robots occurred originally due to hub v9.31, and in hub v9.33 it reoccurred.

Upgrade the hub. (Note you may also have to clear the niscache but this is a good step in any case).


If the robot is throwing the communication error and hence is inaccessible, restart the Nimsoft Robot Watcher Service so it becomes available again.

Then when the robot comes up, rt-click on the controller probe and choose Update Version...