Some of the most common symptoms of robot (controller) communication issues include:
Check with your team/teammates, and network/security contacts if anything changed on the date when the robots first started exhibiting connectivity/communication issues, e.g., upgrade of the robot, change in IP address, change in configuration, changes to networking/routing, new devices added to the network, Policy-based routing (PBR) changes, security changes, etc. Has any new/additional security software been installed on <date> when the issue started occurring?
ps -ef | grep nim should display these 3 processes listed below:
ps -ef|grep nim
root 5937 5910 0 13:57 ? 00:00:01 nimbus(controller)
root 6652 5937 0 14:13 ? 00:00:00 nimbus(spooler)
root 6654 5937 0 14:13 ? 00:00:00 nimbus(hdb)
On the robot, if only the controller is running and not the hdb and spooler, then it's possible that either the installation did not complete or there is a local firewall enabled and blocking some ports/protocols.
Note that some AV software may generate events that are not categorized as ERROR, they may be categorized as Informational but still cause an issue, e.g., blocking a process/subprocess, etc. CB/Carbon Black is known for categorizing blocks as Informational events.
Also, if firewall(s) are enabled, check with your firewall team to make sure the proper rules are in place to ALLOW connectivity between the robots and their hubs and/or any related hub-to-hub connections (or tunnels via port 48003).
To perform an analysis of communication/connectivity between robot<->hub complete, you MUST check the firewall first.
To check Windows firewall:
service iptables status
service iptables stop
iptables -F (flush the rules)
RHEL 7, 8
systemctl stop firewalld
To list all iptables firewall rules on Linux enter the command:
Check that the robot.cfg has the correct hub and robot info/configuration
robot listens on port 48000, for example:
netstat –an | findstr "48000"
TCP 0.0.0.0:48000 0.0.0.0:0 LISTENING
UDP 0.0.0.0:48000 *:*
hub listens on port 48002, for example:
netstat –an | findstr "48002"
TCP 0.0.0.0:48002 0.0.0.0:0 LISTENING
TCP ##.###.###.###:48002 ##.###.###.###:xxxxx ESTABLISHED
TCP ##.###.###.###:48002 ##.###.###.###:xxxxx ESTABLISHED
TCP ##.###.###.###:xxxxx ##.###.###.###:xxxxx ESTABLISHED
UDP 0.0.0.0:48002 *:*
Network Protocol/Connectivity Analysis and Tracing
To ZOOM IN on the image below, press Ctrl+ a few times on your keyboard. Press Ctrl - to decrease the image size.
For example, in a Wireshark trace on a given hub when a connection or telnet command is run to test the connection to the robot/hub on port 48000/48002, you may see retransmissions which indicates that one or more of the confirmations/ACKs is not successful during the 3-way handshake, e.g, the robot connects to the hub, the hub sends its reply but then when the robot tries to confirm the connection back to the hub, it fails. In a 3-way handshake between a client and a server or a robot and hub,
This may be evidenced during a test from the robot to the parent or Primary hub using telnet:
In this case below, telnet to the Primary hub FROM the robot seems to successfully connect for just a moment, but then the connection is closed.
From robot TO the Primary hub-robot (48000) or hub port (48002)
telnet hubx.example.com 48000
There may be different/unexpected routes from the robot to the parent hub or the Primary hub. This may be evidenced by the telnet command results. In some cases, router-switch misconfiguration may be the cause, e.g., the configured range of IPs for policy-based routing (PBR) includes the problematic robot IP. The range of IPs may have been added on one end of the network route but not the other.
But this type of misconfiguration which may also include an asymmetric route, can only be checked and analyzed by the network team. Asymmetric routing is when network packets leave via one path and return via a different path (unlike symmetric routing, in which packets come and go using the same path), resulting in 'half a conversation' so to speak. This can cause communication issues between any client and server and in this case, hub and robot.
There are four common reasons for packet retransmissions:
Shown below is an example of Wireshark capture showing retransmissions between a robot and a hub where the final ACK did not succeed from the robot to the hub.
To ZOOM IN on the image below, press Ctrl+ on your keyboard.
When trying to open the controller or spooler probe on a robot, you receive a communication error:
Unable to reach controller,
error message: communication error
When there are no tunnels between hubs, then the hub only acts like a "DNS" server to tell the IP:Port of the hubs and robots, so not only do you need to be able to telnet from hub to robot on for instance, port 48000 and robot to hub on port 48002, but you have to be able to also do the following:
telnet to the robot(s) FROM any Infrastructure Manager (IM) 'workstation' / laptop and also FROM the Primary hub, where the Admin Console is hosted, even if the robots aren't under the Primary hub
You MUST open the firewall between the Primary hub and Secondary hub's -> robots
You Must open the firewall FROM IM workstations TO Secondary hub's robots
In terms of opening the firewall, just TCP is sufficient as we only use UDP for "findhub" when the robot is searching the subnet for a nearby hub when its hub goes down
Ports should be opened, e.g., 48000-48100 but the range max number depends on how many probes a customer has on the robots as each probe will also need to be able to receive direct communication on its assigned port. We usually recommend 48000-48100 to be safe as it allows for more probes/probe ports
Another alternative where communication between hubs is cross-network/subnet would be to create a tunnel (and use port 48003) between the Primary and Secondary hubs and see if that allows the communication
In general, always double-check to make sure that the local firewall is disabled, e.g., Windows firewall or run systemctl status firewalld to check if it's enabled on the Linux OS, then disable it, or make sure exceptions are in place for the robot allowing TCP/UDP, and port 48000. If the firewall was blocking and you could not open the controller GUI, consider permanently disabling it.
If you login locally to the hub that the robot belongs to (sits under), and install the Infrastructure Manager (IM) from /uimhome page, can you then login to IM, and see the Robot in the left-side navigation window in IM?
If so, choose Tools->Connect from the IM menu and see if you can click 'Get Info' for that hostname or the IP address. Does this return results?
If you can, try opening the controller GUI from that parent hub in IM. Did this work?
If not run wireshark trace, select the local nic, use a filter like-> ip.addr==<robot ip_address> and see if there is any blocking or TCP RST transmissions.
A TCP Reset (RST) packet is used by a TCP sender e.g., the parent hub, to indicate that it will neither accept nor receive more data. Out-of-path network management devices may generate and inject TCP Reset packets in order to terminate undesired connections.
Its is highly recommended that you also contact your security/firewall team to check the firewall logs when youre conducting some communication tests if the robot or hub are behind the firewall andor in a DMZ, and/or if you've checked just about everything else listed in this dociment, but the connection is still failing or you cannot open the controler GUI in IM.