Duplicate, missing, inactive and/or broken robots in IM after upgrading hub to 9.31 or 9.33
search cancel

Duplicate, missing, inactive and/or broken robots in IM after upgrading hub to 9.31 or 9.33

book

Article ID: 208412

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

After upgrading to UIM 20.3x and/or hub 9.31 I am seeing issues with my robots in Infrastructure Manager, such as:

- some robots appear more than once (dupes) 
- some robots do not appear at all 
- 'ghost' robots - some robots appear at first, and then disappear later
- many robot inactive alarms are being generated for robots that are appearing or have been duplicated 
- it may shift over time, and affect a different set of robots 
- some robots appear red when they were fine before 
- some robots, when you click on them, do not show any probes
- probes not displaying on the robot
- Probe is not showing when the package is deployed to the robot
- probes cannot be opened on any given robot(s)
- any combination of the above behaviors

Environment

  • Release: 20.3
  • Component: UIM - HUB
  • hub v9.31, v9.33

Cause

The root cause of this issue is generally related to robot cloning but that is not the only cause.

Very likely, someone in your environment has, at some point recently or in the past, been creating robots by some sort of cloning process where they have a master image with UIM robot and they have been making clones of those machines with the robot already installed.

When you do this, you have to follow certain best practices which are not being followed:
https://knowledge.broadcom.com/external/article/33662/best-practices-for-cloning-systems-with.html

So to explain what the underlying problem is:  Every robot has a robot_device_id (also known as dev_id) associated with it and all the metrics for the robot get linked to the robot based on this ID.  This device ID is uniquely generated at install time and is based on a hash of the robot's name and IP address.

When you clone a machine with the robot installed already you will also clone the robot_device_id if you don't follow the best practices above, which then causes numerous problems with metrics and duplication.  

Due to some unforeseen changes in the new version of the hub, v9.31, it also caused this new problem with robot duplication in IM.

The KB link above has more detail and you will need to share this process with your teams who are creating new images and spinning up new servers, but basically, they will need to remove the niscache folder contents on the "master image" before they make clones, or they will have to clear it out on each clone after creating it.

Resolution

If you're impacted by the duplication or any one of the other symmptoms listed in the Article introduction listed above, please do the following:

- Upgrade all hubs in the environment to hub 9.33HF1 which is available at the UIM hotifix site:

https://support.broadcom.com/external/content/release-announcements/CA-Unified-Infrastructure-Management-Hotfix-Index/7233

or directly here:

   hub_9.33_HF1.zip

- RENAME (do not delete, in case you need to restore) the file: ($NIMSOFT_HOME)/nimsoft/hub/robot.sds on every hub 

- After renaming this file, stop the hub (on Linux, "service nimbus stop", on Windows, stop the Nimsoft Robot Watcher Service)

- wait a full 6 minutes (use a timer/stopwatch) and then start the hub again.

- after the restart many/all of your robots may be temporarily missing from the Infrastructure Manager (IM) GUI.

- wait approx. 15 minutes and all active robots should check in and re-register, and then your robots should appear correctly as they did prior to the upgrade.

Note that if some robots are missing, restarting the Nimsoft Robot service should make them show up in the console again. If the robots do not reappear, you may not have left the hub down long enough.  Try the steps above again, and this time, wait a full 10 minutes.

If it still doesn't work you will have to restore the robot.sds which you renamed earlier and restart the hubs, and then contact support to troubleshoot why the robots aren't coming back.

Note: if the robots are able to reach a secondary hub, then within the 5-minute shutdown period, the robots may temporarily assign themselves to that secondary hub. If this is the case, then after you start the hub back up, you can either restart each robot to move them back to their original hub, or you can shut down the secondary hub for 5 minutes (without removing robot.sds) in which case they should then move back to their original hub.

- Next you need to follow this KB Article to reset all the dev ID's in your environment on every active robot:

https://knowledge.broadcom.com/external/article?articleId=208622

After this step, you should be able to upgrade the hub back to the very latest version, but please make sure you are not spinning up any new clones without following the best practices link above or the issue will return.

Additional Information

Best Practices for cloning systems with Nimsoft (UIM) Robot already installed
https://knowledge.broadcom.com/external/article/33662

How can I clear niscache and/or reset the robot_device_id of every active robot in my UIM domain?
https://knowledge.broadcom.com/external/article/208622

Background

- why hub v9.33HF1 or higher must be deployed to resolve 1 or more of the symptoms listed in this article:

In hub v9.20HF16 which was merged forward into hub v9.31, there was a fix applied to the hub for situations where renaming a robot caused the hub to send "robot inactive" alarms. (Support Case: 31888942)

In this hotfix, the robot_device_id was used for checking robot uniqueness at the hub level to handle robot name changes. Unfortunately, this caused issues in certain environments where there was a possibility of duplicate device IDs.

Development then reverted the changes done for defect# DE459104 in 9.33HF1.