One of our hubs is not stable. Because of this monitoring is adversely affected. We have restarted the Nimsoft service several times, but the issue persists. CPU Utilization is high on this server and the hub is utilizing more.
The problem was with spurious entries of the controller probe in the probes administration of security.cfg, which was failing robot logins. When this was corrected robots started logging in and contacting their hubs.
After the above change, hub was hitting CPU 100% continuously, and increasing the system resource stabilized the CPU usage.
Next, hub probe on hub(b) was continuously restarting as the hub was busy it was not responding to the local controller status check. So the controller was restarting the hub thinking the hub was not reachable. Configured hub(b) controller to wait for more time before considering the hub to be non-responsive. After the change, hub was stabilized and stopped restarting itself continuously.
After the third issue, hub(b) still was not accessible in intervals. It was found that the connections on hub(48002) and spooler(48001) were getting stacked up (causing multiple connections in CLOSE_WAIT status), which was causing the hub to be busy. To decrease the load on the spooler, post threads on the spooler was increased to 200, which allowed more connections on 48001 but we still see connections stacking up, but 48001 was more accessible than before.
Robots check the status of the hub continuously for every one minute, if the hub is not reachable for two consecutive times the robot will switch to the secondary hub configured (hub(a) for robots under hub(b) and vice versa). Because hub(b) was busy, all the robots were switching to hub(a), this was causing hub(a) to become busy, so after some time the robots were moved back to hub(b). Every switch of the hub at the robot side will increase the traffic on 48002 and will make the hub too busy. To reduce the status traffic from robots towards the hub, the interval of status check was increased from one minute to five minutes on robots under hub(b). This improved port 48002 being accessible but still hub was not accessible in intervals.
Even after all of the changes described above, though hub(b) was more stabilized, still, the hub was not reachable at times. The hub was still using 33.3% or more of the CPU.
After the above changes and the problem pointing to the load on the hub, it looks like the load on the hub is more than the supported load, so it is suggested to offload the robots under hub(b) and hub(a) and have robots not have more than 400-500 under each hub.
Also, it was suggested to recreate the hub(b) hub on a fresh machine to rule out any OS/machine-related problems.
The hub was consistently stable after being 'reimaged.'