Check with your team/teammates, and network/security contacts if anything changed on the date when the robots first started exhibiting connectivity/communication issues, e.g., upgrade of the robot, change in IP address, change in configuration, changes to networking/routing, new devices added to the network, Policy-based routing (PBR) changes, security changes, etc. Has any new/additional security software been installed on <date> when the issue started occurring?
Run Basic Tests
Run the hostname command on the robot machine to make sure it displays the correct/expected hostname
Test if you can ping the robot from the hub and vice versa but realize that sometimes ping is not allowed due to security.
See if you can resolve the host from each respective host
From the hub, nslookup <robot_hostname>
From the robot, nslookup <hub_hostname>
See if you can connect successfully to and from the hub<->robot via telnet, e.g.:
telnet TO the robot FROM the hub on port 48000 telnet TO the hub FROM the robot on port 48002
If telnet is disabled, enable it if possible or temporarily use the Putty utility if there is no other option.
On the robots, check-in IM/AC if the following robot probes display as green and have ports AND PIDs.
Are the controller, hdb, and spooler processes running? If not, it's possible that they are either being blocked or the robot installation was not 100% successful.
On Windows, check the ‘Nimsoft Robot Watcher Service’ to make sure it's running.
Is the robot (controller) writing to the log file? If not, the controller may be hung, or the loglevel needs to be increased, try loglevel 5 or 6.
At loglevel 5, and logsize set to 100000, check controller, spooler logs and/or check the hub logs. Press F4 to highlight (in red) what you're looking for such as 'error', 'fail', or 'exception'
Run Advanced Tests
Check Robot processes
ps -ef | grep nim should display these 3 processes listed below:
On the robot, if only the controller is running and not the hdb and spooler, then it's possible that either the installation did not complete or there is a local firewall enabled and blocking some ports/protocols.
Trace the network route (source<->destination)
Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach its Parent hub.
Run a tracert (Windows) / traceroute (Linux/UNIX) command to see if the robot has a network route/path to follow to reach the Primary hub (if it's expected to be able to connect to the Primary)
From hub to robot and robot to the hub, does it successfully complete, and if so does it follow the expected route as per the network team?
Check for any crashes or problematic events in the Windows event logs (Application/System), on each machine. For example, application crash dumps, and anti-virus blocking.
IMPORTANT: Note that some AV software may generate events that are not categorized as ERROR, they may be categorized as Informational but still cause an issue, e.g., blocking a process/subprocess, etc. CB/Carbon Black is known for categorizing blocks as Informational events.
Does the Global/Local Anti-Virus configuration contain a full exclusion for ALL Nimsoft Programs? This is a requirement.
Advanced Testing and Analysis
Test access and function from a closer vantage point
Install the Infrastructure Manager (IM) on the hub that the problematic robot reports to, then login to see if the robot displays/controller opens in the IM with no issues (green status). If it does, then the issue may simply be that you cannot access the robot from the location where your IM is installed, e.g., laptop over a VPN connection.
Check to make sure that no local (or intermediate-remote) firewall is auto-enabled/blocking any TCP or UDP traffic to/from the hub and robot:
Use service iptables status command or firewalld commands (RHEL 6/7) to check firewall state
Check with your security team to make sure no Intrusion Prevention System (IPS) / Intrusion Detection System (IDS) is interfering with the connections. Note that this may be a factor given different regions/locations based on where the firewalls are located and/or how they are configured.
Temporarily disable the firewall if that is feasible, and then test communication again.
Also, if firewall(s) are enabled, check with your firewall team to make sure the proper rules are in place to ALLOW connectivity between the robots and their hubs and/or any related hub-to-hub connections (or tunnels via port 48003).
To perform an analysis of communication/connectivity between robot<->hub complete, you MUST check the firewall first.
To check Windows firewall:
Open the Search window ('Type here to Search' in the Taskbar)
Type in 'Firewall'
Then click 'Check firewall status'
Customers normally consult with their Linux system administrators to check/manage their firewalls but if you have access you can run these commands, otherwise, you can check with your security team as to what is allowed.
RHEL 6 iptables:
service iptables status service iptables stop iptables -F (flush the rules)
netstat –an | findstr "48002" TCP 0.0.0.0:48002 0.0.0.0:0 LISTENING TCP 10.xx.xxx.120:48002 10.xx.xxx.120:49201 ESTABLISHED TCP 10.xx.xxx.120:48002 10.xx.xxx.120:49213 ESTABLISHED ... ... TCP 10.74.240.120:65509 10.74.240.120:48002 ESTABLISHED UDP 0.0.0.0:48002 *:*
Check the latest hub AND robot version/hotfix release notes for similar issues that may have already been fixed.
If telnet to/from Hub<-> Robot fails, the underlying issue could be related to hosts/subnet or IP address range's being allowed/disallowed, network packet filtering, network routing, and/or intermediate firewalls.
Protocols 'TCP' AND 'UDP' MUST be allowed bi-directionally for Hub<->robot communication. This is a requirement.
It may be helpful to ask your security team to check your AV log carefully to be sure nothing is being blocked for UIM.
Network Protocol/Connectivity Analysis and Tracing
If after checking all of the factors mentioned above, the Robot<->Hub communication issue(s) persist, then the next step is to do a network test using Wireshark to perform a trace at the same time when you're trying to reach the robot on port 48000 or the hub on port 48002. All robots listen on their default port 48000. All hubs listen on their default port 48002.
The Security/firewall team should check security software/firewall logs while running the telnet test between the robot and Hub
The network team should check routes when applicable (tracert from Hub to the robot, and vice versa should work successfully) and check the communication between hub and robot using Wireshark trace for the source (Robot) and the destination (hub). For example, you may see that TCP errors are high. Out-of-Order, Duplicate ACKs, and retransmissions which may indicate networking/routing issues.
To ZOOM IN on the image below, press Ctrl+ on your keyboard.
For example, in a Wireshark trace on a given hub when a connection or telnet command is run to test the connection to the robot/hub on port 48000/48002, you may see retransmissions which indicates that one or more of the confirmations/ACKs is not successful during the 3-way handshake, e.g, the robot connects to the hub, the hub sends its reply but then when the robot tries to confirm the connection back to the hub, it fails. In a 3-way handshake between a client and a server or a robot and hub,
Source sends SYN (Sequence Number) to target
Target responds with SYN-ACK
Source responds with ACK1 but if it doesn’t, then routing may be in question, especially if traceroute gives different results for other robots in the same network segment that CAN connect to the hub.
This may be evidenced during a test from the robot to the parent or Primary hub using telnet:
In this case below, telnet to the Primary hub FROM the robot seems to successfully connect for just a moment, but then the connection is closed.
From robot TO the Primary hub-robot (48000) or hub port (48002)
telnet hub1.company.com 48000 Trying 10.xx.xx.xx… Connected to hub1.company.com Escape character is '^]'. Connection closed by foreign host.
There may be different/unexpected routes from the robot to the parent hub or the Primary hub. This may be evidenced by the telnet command results. In some cases, router-switch misconfiguration may be the cause, e.g., the configured range of IPs for policy-based routing (PBR) includes the problematic robot IP. The range of IPs may have been added on one end of the network route but not the other. But this type of misconfiguration which may also include an asymmetric route, can only be checked and analyzed by the network team. Asymmetric routing is when network packets leave via one path and return via a different path (unlike symmetric routing, in which packets come and go using the same path), resulting in 'half a conversation' so to speak. This can cause communication issues between any client and server and in this case, hub and robot.
There are four common reasons for packet retransmissions:
The lack of an acknowledgment that data has been received within a reasonable time
The sender discovering that transmission was unsuccessful (usually through out-of-band means)
The receiver notifying the sender that expected data hasn’t been received
The receiver discovering that data has been damaged/corrupted during initial transmission
Shown below is an example of Wireshark capture showing retransmissions between a robot and a hub where the final ACK did not succeed from the robot to the hub.
To ZOOM IN on the image below, press Ctrl+ on your keyboard.
Robots installed in a DMZ
When a robot is installed in the DMZ and the hub is outside the DMZ, it may be worth it for the network team to run a wireshark trace on the robot to see if its sending/receiving packets to the hub, and also on the hub to view the current traffic.
Also, you can ask the firewall administrator to check the firewall logs to see what if any TCP/UDP traffic to/from the robot is being blocked when its sent to the hub.
You can use security rules (allow hub outside the DMZ to communicate with the robots on port 48000, or limit communication for specific source:destinations), or install UIM hub within DMZ and establish tunnel to hub ‘outside’ the DMZ and use either the default tunnel port 48003 (recommended) or port 443 if the security team prefers it.
Tunnel could be hub inside DMZ TO remote hub or back to the Primary Hub itself.
AIX security software/firewalls: lsfilt: List filters rules present in the table. When created, each rule is assigned a number, which can be easily seen using this command.
In one case, the telnet TO hub on port 48002 from the DMZ robots worked fine but not in the other direction to the robot on port 48000, that direction failed.
End result was in this case, some security software installed on the given machines in the DMZ (Illumio Adaptive Security software), prevented incoming connections because the port was not 'whitelisted.'
Access was granted and communication was then bi-directional between the robots/agents and the hubs.
Infrastructure Manager (IM) communication issues or errors
When trying to open the controller or spooler probe on a robot, you receive a communication error:
Unable to reach controller, node: /<domain>/<hub>/<robot>/controller error message: communication error
When there are no tunnels between hubs, then the hub only acts like a "DNS" server to tell the IP:Port of the hubs and robots, so not only do you need to be able to telnet from hub to robot and robot to hub, but you have to be able to also do the following:
telnet to the robots FROM any Infrastructure Manager (IM) 'workstation' / laptop and also FROM the Primary hub, where the Admin Console is hosted, even if the robots aren't under the Primary hub.
Open the firewall between the Primary hub and Secondary hub's -> robots
Open the firewall From IM workstations TO Secondary hub's robots
In terms of opening the firewall, just TCP is sufficient as we only use UDP for "findhub" when the robot is searching the subnet for a nearby hub when its hub goes down.
Ports should be opened, e.g., 48000-480100 but the number depends on how many probes a customer has on the robots as each probe will also need to be able to receive direct communication on its assigned port. We usually recommend 48000-48100 to be safe and allow for more probes/probe ports.
Another alternative would be to create a tunnel (and use port 48003) between the Primary and Secondary hubs and see if that allows the communication