Troubleshooting UIM Robot-Hub connectivity or communication issues and errors

book

Article ID: 199577

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

  • Communication errors on multiple Windows servers where robots are installed
  • Unable to install probes on Windows systems where robots are installed. Cannot connect to these robots via IM->Tools->Connect
  • Some robots show their Nimsoft Robot Watcher service running but still send robot inactive alarms and appear red in the Infrastructure Manager / Admin Console

Cause

Common symptoms of controller communication issues include: 

  • Unable to install probes on the robot
  • Communication errors occur when trying to open/configure the controller via AC or open the controller GUI in IM
  • Robot processes appear to be up and running, probe is green, controller, hdb, and spooler all have ports and PIDs, but user cannot open controller
  • Robot inactive alarms
  • Tools->Connect from within IM fails on hostname or IP
  • Can successfully connect the robot to a different hub, e.g., secondary but not its intended Parent hub
  • Local or remote firewalls are not expected to be enabled or interfering but communication issues persist
  • Unable to reach controller, communication error
  • Robot communication error checking required
  • Robot inconsistently up/down
  • Unable to contact message
  • Hub-Robot, Robot-Hub troubleshooting

Environment

  • Release: DX UIM version 9.0.2 or higher
  • Component: UIM - ROBOT 7.x or higher

Resolution



Check for Environmental Changes

Check with your team/teammates, and network/security contacts if anything changed on the date when the robots first started exhibiting connectivity/communication issues, e.g., upgrade of the robot, change in IP address, change in configuration, changes to networking/routing, new devices added to the network, Policy-based routing (PBR) changes, security changes, etc. Has any new/additional security software been installed on <date> when the issue started occurring?



Run Basic Tests

  • Run the hostname command on the robot machine to make sure it displays the correct/expected hostname
  • Test if you can ping the robot from the hub and vice versa but realize that sometimes ping is not allowed due to security.
  • See if you can resolve the host from each respective host
    • From the hub, nslookup <robot_hostname>
    • From the robot, nslookup <hub_hostname>
  • See if you can connect successfully to and from the hub<->robot via telnet, e.g.:

    telnet TO the robot FROM the hub on port 48000
    telnet TO the hub FROM the robot on port 48002

  • If telnet is disabled, enable it if possible or temporarily use the Putty utility if there is no other option.
  • On the robots, check-in IM/AC if the following robot probes display as green and have ports AND PIDs.
    • controller
    • hdb
    • spooler
  • Are the controller, hdb, and spooler processes running? If not, it's possible that they are either being blocked or the robot installation was not 100% successful.
  • On Windows, check the ‘Nimsoft Robot Watcher Service’ to make sure it's running.
  • Is the robot (controller) writing to the log file? If not, the controller may be hung, or the loglevel needs to be increased, try loglevel 5 or 6.
  • At loglevel 5, and logsize set to 100000, check controller, spooler logs and/or check the hub logs. Press F4 to highlight (in red) what you're looking for such as 'error', 'fail', or 'exception'


Run Advanced Tests


Check Robot processes

On Windows

  • controller.exe
  • hdb.exe
  • spooler.exe

On Linux/Unix

ps -ef | grep nim should display these 3 processes listed below:

ps -ef|grep nim

     root       5937   5910  0 13:57 ?        00:00:01 nimbus(controller)
     root       6652   5937  0 14:13 ?        00:00:00 nimbus(spooler)
     root       6654   5937  0 14:13 ?        00:00:00 nimbus(hdb)

On the robot, if only the controller is running and not the hdb and spooler, then it's possible that either the installation did not complete or there is a local firewall enabled and blocking some ports/protocols.

Trace the network route (source<->destination)

  • Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach its Parent hub.
  • Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach the Primary hub (if it's expected to be able to connect to the Primary)
  • From hub to robot and robot to the hub, does it successfully complete, and if so does it follow the expected route as per the network team?
  • Check for any crashes or problematic events in the Windows event logs (Application/System), on each machine. For example, application crash dumps, and anti-virus blocking.

IMPORTANT:
Note that some AV software may generate events that are not categorized as ERROR, they may be categorized as Informational but still cause an issue, e.g., blocking a process/subprocess, etc. CB/Carbon Black is known for categorizing blocks as Informational events.

  • Does the Global/Local Anti-Virus configuration contain a full exclusion for ALL Nimsoft Programs? This is a requirement.


Advanced Testing and Analysis


Test access and function from a closer vantage point

  • Install the Infrastructure Manager (IM) on the hub that the problematic robot reports to, then login to see if the robot displays/controller opens in the IM with no issues (green status).
    If it does, then the issue may simply be that you cannot access the robot from the location where your IM is installed, e.g., laptop over a VPN connection.
  • Check to make sure that no local (or intermediate-remote) firewall is auto-enabled/blocking any TCP or UDP traffic to/from the hub and robot:
  • Use service iptables status command or firewalld commands (RHEL 6/7) to check firewall state
  • Check with your security team to make sure no Intrusion Prevention System (IPS) / Intrusion Detection System (IDS) is interfering with the connections. Note that this may be a factor given different regions/locations based on where the firewalls are located and/or how they are configured.
  • Temporarily disable the firewall if that is feasible, and then test communication again.

Also, if firewall(s) are enabled, check with your firewall team to make sure the proper rules are in place to ALLOW connectivity between the robots and their hubs and/or any related hub-to-hub connections (or tunnels via port 48003).

To perform an analysis of communication/connectivity between robot<->hub complete, you MUST check the firewall first.

Windows firewall

To check Windows firewall:

  1. Open the Search window ('Type here to Search' in the Taskbar)
  2. Type in 'Firewall'
  3. Then click 'Check firewall status'

Linux/Unix Firewall

Customers normally consult with their Linux system administrators to check/manage their firewalls but if you have access you can run these commands, otherwise, you can check with your security team as to what is allowed.
 
RHEL 6
iptables:

service iptables status
service iptables stop
iptables -F (flush the rules)


RHEL 7, 8
firewalld:

firewall-cmd --state
systemctl stop firewalld

How to Start/Stop and Enable/Disable FirewallD and Iptables Firewall in Linux

To list all iptables firewall rules on Linux enter the command:

iptables -L

AIX

    • Check commands online for the given AIX OS Version

Solaris

    • Check commands online for the given Solaris OS Version


Check Ports and Protocols

  • UIM (Nimsoft) Protocols for all components are TCP except for controller, hdb, and spooler, which also require UDP.
  • UDP broadcast is used for the discovery of the hub, spooler, and controller components. All other core communications are done via TCP.



Quick ‘Testing’ Checklist

Check that the robot.cfg has the correct hub and robot info/configuration

  1. Check to make sure that the robot.cfg has not been corrupted nor truncated (compare with a healthy robot)
  2. ping successful? (Hub<->robot)
  3. tracert successful? (Hub<->robot)
  4. telnet TO the Hub from the Robot on port 48002 to see if it succeeds consistently without any intermittent failures
  5. telnet TO the robot FROM the hub on port 48000 so it appears that communication from hub TO robot on port 48000 is being filtered/blocked
  6. Check for any related communication/connectivity errors in the local robot's controller.log when loglevel is set to 6, logsize set to 5000
  7. Verify via netstat that the robot is LISTENING on port 48000 so that the hub can contact the robot on that specific port

robot listens on port 48000, for example:

netstat –an | findstr "48000"
TCP    0.0.0.0:48000          0.0.0.0:0              LISTENING
UDP    0.0.0.0:48000          *:*

hub listens on port 48002, for example:

netstat –an | findstr "48002"
TCP    0.0.0.0:48002          0.0.0.0:0              LISTENING
TCP    10.xx.xxx.120:48002    10.xx.xxx.120:49201    ESTABLISHED
TCP    10.xx.xxx.120:48002    10.xx.xxx.120:49213    ESTABLISHED
...
...
TCP    10.74.240.120:65509    10.74.240.120:48002    ESTABLISHED
UDP    0.0.0.0:48002          *:*



Check Hotfixes

  • Check the latest hub AND robot version/hotfix release notes for similar issues that may have already been fixed.
  • If telnet to/from Hub<-> Robot fails, the underlying issue could be related to hosts/subnet or IP address range's being allowed/disallowed, network packet filtering, network routing, and/or intermediate firewalls.
  • Protocols 'TCP' AND 'UDP' MUST be allowed bi-directionally for Hub<->robot communication. This is a requirement.

 



Check Anti-Virus



Advanced Troubleshooting

Network Protocol/Connectivity Analysis and Tracing

  • If after checking all of the factors mentioned above, the Robot<->Hub communication issue(s) persist, then the next step is to do a network test using Wireshark to perform a trace at the same time when you're trying to reach the robot on port 48000 or the hub on port 48002. All robots listen on their default port 48000. All hubs listen on their default port 48002.
  • The Security/firewall team should check security software/firewall logs while running the telnet test between the robot and Hub
  • The network team should check routes when applicable (tracert from Hub to the robot, and vice versa should work successfully) and check the communication between hub and robot using Wireshark trace for the source (Robot) and the destination (hub). For example, you may see that TCP errors are high.  Out-of-Order, Duplicate ACKs, and retransmissions which may indicate networking/routing issues. 


To ZOOM IN on the image below, press Ctrl+ on your keyboard.

For example, in a Wireshark trace on a given hub when a connection or telnet command is run to test the connection to the robot/hub on port 48000/48002, you may see retransmissions which indicates that one or more of the confirmations/ACKs is not successful during the 3-way handshake, e.g, the robot connects to the hub, the hub sends its reply but then when the robot tries to confirm the connection back to the hub, it fails. In a 3-way handshake between a client and a server or a robot and hub, 

  1. Source sends SYN (Sequence Number) to target
  2. Target responds with SYN-ACK
  3. Source responds with ACK1 but if it doesn’t, then routing may be in question, especially if traceroute gives different results for other robots in the same network segment that CAN connect to the hub.

This may be evidenced during a test from the robot to the parent or Primary hub using telnet:

In this case below, telnet to the Primary hub FROM the robot seems to successfully connect for just a moment, but then the connection is closed.

From robot TO the Primary hub-robot (48000) or hub port (48002)

     telnet hub1.company.com 48000
  Trying 10.xx.xx.xx…
  Connected to hub1.company.com
  Escape character is '^]'.
  Connection closed by foreign host.

 

There may be different/unexpected routes from the robot to the parent hub or the Primary hub. This may be evidenced by the telnet command results. In some cases, router-switch misconfiguration may be the cause, e.g., the configured range of IPs for policy-based routing (PBR) includes the problematic robot IP. The range of IPs may have been added on one end of the network route but not the other. But this type of misconfiguration which may also include an asymmetric route, can only be checked and analyzed by the network team. Asymmetric routing is when network packets leave via one path and return via a different path (unlike symmetric routing, in which packets come and go using the same path), resulting in 'half a conversation' so to speak. This can cause communication issues between any client and server and in this case, hub and robot.



Retransmissions

There are four common reasons for packet retransmissions:

  • The lack of an acknowledgment that data has been received within a reasonable time
  • The sender discovering that transmission was unsuccessful (usually through out-of-band means)
  • The receiver notifying the sender that expected data hasn’t been received
  • The receiver discovering that data has been damaged/corrupted during initial transmission


Shown below is an example of Wireshark capture showing retransmissions between a robot and a hub where the final ACK did not succeed from the robot to the hub.

To ZOOM IN on the image below, press Ctrl+ on your keyboard.

Additional Information

Robots installed in a DMZ

  • When a robot is installed in the DMZ and the hub is outside the DMZ, it may be worth it for the network team to run a wireshark trace on the robot to see if its sending/receiving packets to the hub, and also on the hub to view the current traffic.
  • Also, you can ask the firewall administrator to check the firewall logs to see what if any TCP/UDP traffic to/from the robot is being blocked when its sent to the hub.
  • You can use security rules (allow hub outside the DMZ to communicate with the robots on port 48000, or limit communication for specific source:destinations), or install UIM hub within DMZ and establish tunnel to hub ‘outside’ the DMZ and use either the default tunnel port 48003 (recommended) or port 443 if the security team prefers it.
  • Tunnel could be hub inside DMZ TO remote hub or back to the Primary Hub itself.


OS-specific notes

  • AIX security software/firewalls: lsfilt: List filters rules present in the table. When created, each rule is assigned a number, which can be easily seen using this command.
  • In one case, the telnet TO hub on port 48002 from the DMZ robots worked fine but not in the other direction to the robot on port 48000, that direction failed.
  • End result was in this case, some security software installed on the given machines in the DMZ (Illumio Adaptive Security software), prevented incoming connections because the port was not 'whitelisted.'
  • Access was granted and communication was then bi-directional between the robots/agents and the hubs.

AIX mkfilt firewall command reference



Infrastructure Manager (IM) communication issues or errors

When trying to open the controller or spooler probe on a robot, you receive a communication error:

Unable to reach controller,
 node:
/<domain>/<hub>/<robot>/controller
 error message: communication error

When there are no tunnels between hubs, then the hub only acts like a "DNS" server to tell the IP:Port of the hubs and robots, so not only do you need to be able to telnet from hub to robot and robot to hub, but you have to be able to also do the following:

  • telnet to the robots FROM any Infrastructure Manager (IM) 'workstation' / laptop and also FROM the Primary hub, where the Admin Console is hosted, even if the robots aren't under the Primary hub.
  • Open the firewall between the Primary hub and Secondary hub's -> robots
  • Open the firewall From IM workstations TO Secondary hub's robots
  • In terms of opening the firewall, just TCP is sufficient as we only use UDP for "findhub" when the robot is searching the subnet for a nearby hub when its hub goes down.
  • Ports should be opened, e.g., 48000-480100 but the number depends on how many probes a customer has on the robots as each probe will also need to be able to receive direct communication on its assigned port. We usually recommend 48000-48100 to be safe and allow for more probes/probe ports.
  • Another alternative would be to create a tunnel (and use port 48003) between the Primary and Secondary hubs and see if that allows the communication

 

Related Links

Robots registering with 127.0.0.1 Loopback IP/169.x APIPA/DHCP address


 

Attachments