Discovery server probeDiscovery queue is backing up displays as yellow and not working as expected

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM)

Issue/Introduction

The discovery_server probe is not working as expected. The probeDiscovery queue is filling up but not sending messages at certain times. The queue is very large and building up, not draining quickly enough. Sometimes, messages are not being processed at all.

Environment

Release: 20.3, 20.4
Component: UIM - DISCOVERY_SERVER

Resolution

Please do the following:

1. Allocate more (e.g., double) Java Memory for the discovery_server probe

In IM, open the raw configuration with Shift - right mouse then click on the probe

Select startup -> opt and update the following values or add 2 GB to the min and to the max above your current settings, for example:

java_mem_max = -Xmx6144m
java_mem_init = -Xms4096m

2. REBUILD specific table indexes using a daily job that runs off-hours (if you do not have partitioning enabled)

These are the key tables with indexes that require defragmentation (DAILY)

Ask your DBA to set up a job to run this DAILY (off-hours) index defrag job.

ALTER INDEX ALL ON CM_COMPUTER_SYSTEM REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ATTR REBUILD;
ALTER INDEX ALL ON CM_DEVICE_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_NIMBUS_ROBOT REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ORIGIN REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CI REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CS REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CS_CI REBUILD;
ALTER INDEX ALL ON CM_DISCOVERY_NETWORK REBUILD;
ALTER INDEX ALL ON S_QOS_DATA REBUILD;
ALTER INDEX ALL ON S_QOS_DEFINITION REBUILD;
ALTER INDEX ALL ON S_QOS_SNAPSHOT REBUILD;
ALTER INDEX ALL ON NAS_TRANSACTION_SUMMARY REBUILD;
ALTER INDEX ALL ON NAS_ALARMS REBUILD;

3. Edit udm_manager.cfg (20.4 CU6 or older)

a. In the "Raw Configure" setup, add the following key under the "setup" section:" schema_transact_retries" and set the value to 5000.
b. Also change the "schema_connection_timeout_minutes" value to 30, or even as high as 90.

4. Edit discovery_server.cfg

a. In discovery_server probe's Raw Configure GUI, select the setup folder, then in the right-hand pane create a new section called nimbusscan
b. Select the newly created nimbusscan folder from the left-hand pane, then add the following new key value in the right-hand pane:

nis_cache_update_interval_secs = 3600

This will increase the discovery scan time so that it has more time to do its work.

c. Apply the changes, then click OK/close the window.
d. Cold start (deactivate, then activate) the discovery_server probe

5. Check udm_manager settings, (20.4 CU6 or older)
See setup->datomic section:

memory_index_max = 516m
memory_index_threshold = 64m
object_cache_max = 256m
heartbeat_interval_msec = 30000

schema_transact_retries = 5000
schema_connection_timeout_minutes = 30

and this one under the setup->datomic section:
heartbeat_interval_msec = 30000

Note that any time the vmware or another storage probe is restarted, larger graphs will make it to the queue and the queue can then fall behind once again but should only fall behind for a while then catch up.

6. Configure partial graph discovery!!!

For any/all other storage probes (besides vmware since this is implemented in the GA version,) e.g., hp_3par and ibmvm probes, to publish partial graphs every monitoring interval by making the following configuration changes to all of these probes.

a. From the probe's Raw Configure GUI, select the setup folder from the left-hand pane, then add the following key value in the right-hand pane:

discovery_server_version = 9.02

NOTE: The version does not have to match the version of the discovery_server installed, it just has to be a value of 8.2 or higher.

b. Apply the change
c. Cold start (Deactivate, then Activate) the probe.

When and if the probeDiscovery queue has fallen behind, to allow it to catch up, it can be emptied, or disabled for a while, then re-enabled.

Note that any time the VMware or other busy storage probes are restarted, full graph objects will make it to the queue and the queue may then fall behind once again.

7. Increase hub 'postroute_reply_timeout'

On the Primary hub or child robot where discovery_server has been deployed, increase the postroute_reply_timeout, for example from 180 to 300.

This value is in seconds and determines how long the hub will wait for a reply from any queue/subscriber after sending messages. The default = 180. It controls how long the hub waits (in seconds) for a reply from the remote hub after sending a bulk of messages on a queue before deciding that it didn't go through and then re-sending the bulk.

8. Deactivate 1 or more instances of any storage probes that were recently deployed that are causing a high number of discovery messages and causing the queue to backup.

9. Via the hub GUI status window, rt-click on the probeDiscovery queue and choose Empty it to give it a chance to catch up. You may have to do this 3 or more times.

10. Activate one instance of the given storage probes that were causing the queue to back up and then recheck the probeDiscovery queue

11. Activate another instance and recheck.

Additional Information

If the udm_inventory queue turns yellow as per the hub GUI Status view and is not processing any data, or the probe is throwing a memory error in the log, for example:

DEBUG [main, udm_manager] Calling Peer.createDatabase to establish Datomic connection.
DEBUG [main, udm_manager] Exception establishing Datomic connection: :db.error/not-enough-memory (datomic.objectCacheMax + datomic.memoryIndexMax) exceeds 75% of JVM RAM

Try the following:

1. Deactivate udm_manager (this applies to DX UIM version 20.4 CU6 or older)

2. Rt-Click and delete the udm_manager probe

3. Delete the udm_manager folder on the Primary hub file system

4. Redeploy the udm_manager probe from the local archive on the Primary

5. Delete and Recreate the probeDiscovery queue

6. Cold start the discovery_server (Deactivate-Activate)

Additionally, we have seen scenarios like this, where discovery queues may backup after restarting services. But, as long as the queues are connected (green) the messages should eventually get processed.

Problems usually occur when the queue status is yellow. You can use Dr.Nimbus to view the messages in the queue, but you might need to restart the nimsoft services again. But as long as the queues are in contact, at times the queues grow, but gradually the messages should be processed and clear out. In most cases, all that it needs is some time.

Also check for slow IO/disk via Task Manager/Resource monitor.

Disks should be SSD and Disk Queue Length should remain below 1 so please check both.

NOTE:

If everything listed above in this KB Article has been addressed but the discovery_server itself also seems to be having problems and not working consistently, e.g., performance/scalability issues, and/or its also also consuming large amounts of memory and an increase in available memory to the probe has not improved performance, then it may be worth 'offloading' the discovery_server to a child robot of the Primary:

How to offload the discovery_server to a child robot of the Primary Hub
https://knowledge.broadcom.com/external/article/135036/