Troubleshooting UIM Robot-Hub connectivity or communication issues and errors
search cancel

Troubleshooting UIM Robot-Hub connectivity or communication issues and errors

book

Article ID: 199577

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

This KB Article focuses on Hub<-> Robot communication issues, where the Hub cannot contact the robot or the robot cannot communicate with its hub. Robot communication issues and errors may happen for a variety of reasons including but not limited to:

  • Communication errors on multiple Windows or Linux/Unix systems where DX UIM robots are installed
  • Unable to install probes on Windows systems where robots are installed. Cannot connect to these robots via IM->Tools->Connect
  • Some robots show their Nimsoft Robot Watcher service running but still send robot inactive alarms
  • Robots appear red in the Infrastructure Manager / Admin Console
  • Robot communication errors in the controller log

Some of the most common symptoms of robot (controller) communication issues include: 

    • Unable to install probes on the robot
    • Communication errors occur when trying to open/configure the controller via AC or open the controller GUI in IM
    • Robot processes appear to be up and running, probe is green, controller, hdb, and spooler all have ports and PIDs, but the user cannot open the controller probe in AC or via IM controller GUI
    • Robot inactive alarms (persistent)
    • Tools->Connect from within the Infrastructure Manager fails on hostname or IP
    • Can successfully connect the robot to a different hub, e.g., secondary but not its intended Parent hub
    • Local or remote firewalls are not expected to be enabled or interfering but communication issues persist
    • 'Unable to reach controller,' communication errors
    • Robot communication error checking is required
    • Robot inconsistently up/down or intermittent
    • 'Unable to contact' message
    • hub NO CONTACT (communication error) message
    • Corrupted Hub instance
    • Hub-Robot, Robot-Hub troubleshooting guidance needed

Environment

  • Release: DX UIM version 9.0.2 or higher
  • Component: UIM - ROBOT 7.x or higher

Cause

  • DX UIM troubleshooting guidance
  • Robot communication issues or errors
  • Hub communication issues or errors
  • Network communication between hubs and robots

Resolution



Check for Environmental Changes

Check with your team/teammates, and network/security contacts if anything changed on the date when the robots first started exhibiting connectivity/communication issues, e.g., upgrade of the robot, change in IP address, change in configuration, changes to networking/routing, new devices added to the network, Policy-based routing (PBR) changes, security changes, etc. Has any new/additional security software been installed on <date> when the issue started occurring?



Run Basic Tests

  • Run the hostname command on the robot machine to make sure it displays the correct/expected hostname
  • Test if you can ping the robot from the hub and vice versa but realize that sometimes ping is not allowed due to security.
  • See if you can resolve the host from each respective host
    • From the hub, nslookup <robot_hostname>
    • From the robot, nslookup <hub_hostname>
  • See if you can connect successfully to and from the hub<->robot via telnet, e.g.:

    telnet TO the robot FROM the hub on port 48000
    telnet TO the hub FROM the robot on port 48002

  • If telnet is disabled, enable it if possible or temporarily use the Putty utility if there is no other option.
  • On the robots, check-in IM/AC if the following robot probes display as green and have ports AND PIDs.
    • controller
    • hdb
    • spooler
  • Are the controller, hdb, and spooler processes running? If not, it's possible that they are either being blocked or the robot installation was not 100% successful.
  • On Windows, check the ‘Nimsoft Robot Watcher Service’ to make sure it's running.
  • Is the robot (controller) writing to the log file? If not, the controller may be hung, or the loglevel needs to be increased, try loglevel 5 or 6.
  • At loglevel 5, and logsize set to 100000, check controller, spooler logs and/or check the hub logs. Press F4 to highlight (in red) what you're looking for such as 'error', 'fail', or 'exception'


Run Advanced Tests


Check Robot processes

On Windows

  • controller.exe
  • hdb.exe
  • spooler.exe

On Linux/Unix

ps -ef | grep nim should display these 3 processes listed below:

ps -ef|grep nim

     root       5937   5910  0 13:57 ?        00:00:01 nimbus(controller)
     root       6652   5937  0 14:13 ?        00:00:00 nimbus(spooler)
     root       6654   5937  0 14:13 ?        00:00:00 nimbus(hdb)

On the robot, if only the controller is running and not the hdb and spooler, then it's possible that either the installation did not complete or there is a local firewall enabled and blocking some ports/protocols.

Trace the network route (source<->destination)

  • Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach its Parent hub.
  • Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach the Primary hub (if it's expected to be able to connect to the Primary)
  • From hub to robot and robot to the hub, does it successfully complete, and if so does it follow the expected route as per the network team?
  • Check for any crashes or problematic events in the Windows event logs (Application/System), on each machine. For example, application crash dumps, and anti-virus blocking.

IMPORTANT:
Note that some AV software may generate events that are not categorized as ERROR, they may be categorized as Informational but still cause an issue, e.g., blocking a process/subprocess, etc. CB/Carbon Black is known for categorizing blocks as Informational events.

  • Does the Global/Local Anti-Virus configuration contain a full exclusion for ALL Nimsoft Programs? This is a requirement.


Advanced Testing and Analysis


Test access and function from a closer vantage point

  • Install the Infrastructure Manager (IM) on the hub that the problematic robot reports to, then login to see if the robot displays/controller opens in the IM with no issues (green status).
    If it does, then the issue may simply be that you cannot access the robot from the location where your IM is installed, e.g., laptop over a VPN connection.
  • Check to make sure that no local (or intermediate-remote) firewall is auto-enabled/blocking any TCP or UDP traffic to/from the hub and robot:
  • Use service iptables status command or firewalld commands (RHEL 6/7) to check firewall state
  • Check with your security team to make sure no Intrusion Prevention System (IPS) / Intrusion Detection System (IDS), whitelisting software, malware software, etc., is interfering with the connections. Note that this may be a factor given different regions/locations based on where the firewalls are located and/or how they are configured.
  • Temporarily disable the firewall if that is feasible, and then test communications again.

Also, if firewall(s) are enabled, check with your firewall team to make sure the proper rules are in place to ALLOW connectivity between the robots and their hubs and/or any related hub-to-hub connections (or tunnels via port 48003).

To perform an analysis of communication/connectivity between robot<->hub complete, you MUST check the firewall first.

Windows firewall

To check Windows firewall:

  1. Open the Search window ('Type here to Search' in the Taskbar)
  2. Type in 'Firewall'
  3. Then click 'Check firewall status'

Linux/Unix Firewall

Customers normally consult with their Linux system administrators to check/manage their firewalls but if you have access you can run these commands, otherwise, you can check with your security team as to what is allowed.
 
RHEL 6
iptables:

service iptables status
service iptables stop
iptables -F (flush the rules)

To list all iptables firewall rules on Linux enter the command:

iptables -L


RHEL 7, 8
firewalld:

firewall-cmd --state
systemctl stop firewalld

How to Start/Stop and Enable/Disable FirewallD and Iptables Firewall in Linux

 

SE Linux

To determine if SELinux is blocking an application or process, you can check the audit log for denied operations:

Go to the /var/log/audit/audit.log file
Search for messages that contain the word "denied"

To check if SELinux is enabled on a Linux distribution, you can use the sestatus command in a terminal or SSH session:

    1. Open a terminal or SSH session
    2. Run the command sestatus
    3. If SELinux is enabled, the output will be similar to SELinux status: enabled

Checkthe following regarding SELinux with the Linux/Unix administrator.

To change SELinux from enabled to disabled and vice versa change the SELinux variable in /etc/sysconfig/selinux and reboot the server.
If SELinux is enabled use setenforce 0 to change to PERMISSIVE mode ; to change from PERMISSIVE mode back to ENFORCING use Setenforce 1 from command line.

AIX

    • Check commands online for the given AIX OS Version
    • Check the Loopback interface is enabled on the robot

Solaris

    • Check commands online for the given Solaris OS Version


Check Ports and Protocols

  • UIM (Nimsoft) Protocols for all components are TCP except for controller, hdb, and spooler, which also require UDP.
  • UDP broadcast is used for the discovery of the hub, spooler, and controller components. All other core communications are done via TCP.



Robot Communication - Quick ‘Test’ Checklist

Check that the robot.cfg has the correct hub and robot info/configuration

  1. Check to make sure that the robot.cfg has not been corrupted nor truncated (compare with a healthy robot)
  2. ping successful? (Hub<->robot)
  3. tracert successful? (Hub<->robot)
  4. telnet TO the Hub from the Robot on port 48002 to see if it succeeds consistently without any intermittent failures
  5. telnet TO the robot FROM the hub on port 48000 so it appears that communication from hub TO robot on port 48000 is being filtered/blocked
  6. Check for any related communication/connectivity errors in the local robot's controller.log when loglevel is set to 6, logsize set to 5000
  7. Depending on the OS/version be very careful to check IF the firewall, Windows firewall, intermediate firewall, iptables/firewalld (Linux/UNIX) are disabled or there are rules allowing TCP/UDP traffic to and from the hubs and robots.
  8. No Security software, e.g., anti-virus/malware, etc. blocking the connection/communication.
    • On Windows systems do a careful check of Windows events (Application and System), not just errors but some AV software throws blocks as Informational events!
    • UNIX/Linux, check /var/log files and/or ask the system admins if there are any security apps installed that could possibly block communications between the local robot and Hubs
  9. Verify via netstat that the robot is LISTENING on port 48000 so that the hub can contact the robot on that specific port

robot listens on port 48000, for example:

netstat –an | findstr "48000"
TCP    0.0.0.0:48000          0.0.0.0:0              LISTENING
UDP    0.0.0.0:48000          *:*

hub listens on port 48002, for example:

netstat –an | findstr "48002"
TCP    0.0.0.0:48002          0.0.0.0:0              LISTENING
TCP    ##.###.###.###:48002    ##.###.###.###:xxxxx    ESTABLISHED
TCP    ##.###.###.###:48002    ##.###.###.###:xxxxx    ESTABLISHED
...
...
TCP    ##.###.###.###:xxxxx    ##.###.###.###:xxxxx    ESTABLISHED
UDP    0.0.0.0:48002          *:*



Check Hotfixes

  • Check the latest hub AND robot version/hotfix release notes for similar issues that may have already been fixed.
  • If telnet to/from Hub<-> Robot fails, the underlying issue could be related to hosts/subnet or IP address range's being allowed/disallowed, network packet filtering, network routing, and/or intermediate firewalls.
  • Protocols 'TCP' AND 'UDP' MUST be allowed bi-directionally for Hub<->robot communication. This is a requirement.

 



Check Anti-Virus



Advanced Troubleshooting

Network Protocol/Connectivity Analysis and Tracing

  • If after checking all of the factors mentioned above, the Robot<->Hub communication issue(s) persist, then the next step is to do a network test using Wireshark to perform a trace at the same time when you're trying to reach the robot on port 48000 or the hub on port 48002. All robots listen on their default port 48000. All hubs listen on their default port 48002.
  • The Security/firewall team should check security software/firewall logs while running the telnet test between the robot and Hub
  • The network team should check routes when applicable (tracert from Hub to the robot, and vice versa should work successfully) and check the communication between hub and robot using Wireshark trace for the source (Robot) and the destination (hub). For example, you may see that TCP errors are high.  Out-of-Order, Duplicate ACKs, and retransmissions which may indicate networking/routing issues. 


To ZOOM IN on the image below, press Ctrl+ a few times on your keyboard. Press Ctrl - to decrease the image size.

For example, in a Wireshark trace on a given hub when a connection or telnet command is run to test the connection to the robot/hub on port 48000/48002, you may see retransmissions which indicates that one or more of the confirmations/ACKs is not successful during the 3-way handshake, e.g, the robot connects to the hub, the hub sends its reply but then when the robot tries to confirm the connection back to the hub, it fails. In a 3-way handshake between a client and a server or a robot and hub, 

  1. Source sends SYN (Sequence Number) to target
  2. Target responds with SYN-ACK
  3. Source responds with ACK1 but if it doesn’t, then routing may be in question, especially if traceroute gives different results for other robots in the same network segment that CAN connect to the hub.

This may be evidenced during a test from the robot to the parent or Primary hub using telnet:

In this case below, telnet to the Primary hub FROM the robot seems to successfully connect for just a moment, but then the connection is closed.

  • Perform the Wireshark capture on the hub and/or the robot. Run the telnet from the command line on each machine in turn.
  • Select the network interface, e.g., eth0, and then click the Wireshark (shark fin) icon to start the capture to make sure youre seeing traffic.
  • You can then filter on an ip address using ip.addr==<ip_address> and restart the capture.


From robot TO the Primary hub-robot (48000) or hub port (48002)

     telnet hubx.example.com 48000
  Trying 10.xx.xx.xx…
  Connected to hubx.example.com
  Escape character is '^]'.
  Connection closed by foreign host.


There may be different/unexpected routes from the robot to the parent hub or the Primary hub. This may be evidenced by the telnet command results. In some cases, router-switch misconfiguration may be the cause, e.g., the configured range of IPs for policy-based routing (PBR) includes the problematic robot IP. The range of IPs may have been added on one end of the network route but not the other.

But this type of misconfiguration which may also include an asymmetric route, can only be checked and analyzed by the network team. Asymmetric routing is when network packets leave via one path and return via a different path (unlike symmetric routing, in which packets come and go using the same path), resulting in 'half a conversation' so to speak. This can cause communication issues between any client and server and in this case, hub and robot.



Retransmissions

There are four common reasons for packet retransmissions:

  • The lack of an acknowledgment that data has been received within a reasonable time
  • The sender discovering that transmission was unsuccessful (usually through out-of-band means)
  • The receiver notifying the sender that expected data hasn’t been received
  • The receiver discovering that data has been damaged/corrupted during initial transmission


Shown below is an example of Wireshark capture showing retransmissions between a robot and a hub where the final ACK did not succeed from the robot to the hub.

To ZOOM IN on the image below, press Ctrl+ on your keyboard.

Additional Information


Robots installed in a DMZ

  • When a robot is installed in the DMZ and the hub is outside the DMZ, it may be worth it for the network team to run a wireshark trace on the robot to see if its sending/receiving packets to the hub, and also on the hub to view the current traffic. This is especially helpful if a significant effort to troubleshoot the robot communication issue has already been completed.
  • Also, you can ask the firewall administrator to check the firewall logs to see what if any TCP/UDP traffic to/from the robot is being blocked when its sent to the hub.
  • You can use security rules (allow hub outside the DMZ to communicate with the robots on port 48000, or limit communication for specific source:destinations), or install UIM hub within DMZ and establish tunnel to hub ‘outside’ the DMZ and use either the default tunnel port 48003 (recommended) or port 443 if the security team prefers it.
  • Tunnel could be hub inside DMZ TO remote hub or back to the Primary Hub itself.


OS-specific notes

  • AIX security software/firewalls: lsfilt: List filters rules present in the table. When created, each rule is assigned a number, which can be easily seen using this command.
  • In one case, the telnet TO hub on port 48002 from the DMZ robots worked fine but not in the other direction to the robot on port 48000, that direction failed.
  • End result was in this case, some security software installed on the given machines in the DMZ (Illumio Adaptive Security software), prevented incoming connections because the port was not 'whitelisted.'
  • Access was granted and communication was then bi-directional between the robots/agents and the hubs.

AIX mkfilt firewall command reference



Infrastructure Manager (IM) or controller communication errors

When trying to open the controller or spooler probe on a robot, you receive a communication error:

Unable to reach controller,
 node:
/<domain>/<hub>/<robot>/controller
 error message: communication error

When there are no tunnels between hubs, then the hub only acts like a "DNS" server to tell the IP:Port of the hubs and robots, so not only do you need to be able to telnet from hub to robot on for instance, port 48000 and robot to hub on port 48002, but you have to be able to also do the following:

  • telnet to the robot(s) FROM any Infrastructure Manager (IM) 'workstation' / laptop and also FROM the Primary hub, where the Admin Console is hosted, even if the robots aren't under the Primary hub

  • You MUST open the firewall between the Primary hub and Secondary hub's -> robots

  • You Must open the firewall FROM IM workstations TO Secondary hub's robots

  • In terms of opening the firewall, just TCP is sufficient as we only use UDP for "findhub" when the robot is searching the subnet for a nearby hub when its hub goes down

  • Ports should be opened, e.g., 48000-48100 but the range max number depends on how many probes a customer has on the robots as each probe will also need to be able to receive direct communication on its assigned port. We usually recommend 48000-48100 to be safe as it allows for more probes/probe ports

  • Another alternative where communication between hubs is cross-network/subnet would be to create a tunnel (and use port 48003) between the Primary and Secondary hubs and see if that allows the communication

  • In general, always double-check to make sure that the local firewall is disabled, e.g., Windows firewall or run systemctl status firewalld to check if it's enabled on the Linux OS, then disable it, or make sure exceptions are in place for the robot allowing TCP/UDP, and port 48000. If the firewall was blocking and you could not open the controller GUI, consider permanently disabling it.

  • If you login locally to the hub that the robot belongs to (sits under), and install the Infrastructure Manager (IM) from /uimhome page, can you then login to IM, and see the Robot in the left-side navigation window in IM?

  • If so, choose Tools->Connect from the IM menu and see if you can click 'Get Info' for that hostname or the IP address. Does this return results?

  • If you can, try opening the controller GUI from that parent hub in IM. Did this work?

  • If not run wireshark trace, select the local nic, use a filter like->  ip.addr==<robot ip_address> and see if there is any blocking or TCP RST transmissions.

A TCP Reset (RST) packet is used by a TCP sender e.g., the parent hub, to indicate that it will neither accept nor receive more data. Out-of-path network management devices may generate and inject TCP Reset packets in order to terminate undesired connections.

Its is highly recommended that you also contact your security/firewall team to check the firewall logs when youre conducting some communication tests if the robot or hub are behind the firewall andor in a DMZ, and/or if you've checked just about everything else listed in this dociment, but the connection is still failing or you cannot open the controler GUI in IM.


Related Links

Robots registering with 127.0.0.1 Loopback IP/169.x APIPA/DHCP address