Discovery server queue is backing up and not working as expected

book

Article ID: 207256

calendar_today

Updated On:

Products

DX Infrastructure Management CA Unified Infrastructure Management for z Systems CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) NIMSOFT PROBES

Issue/Introduction

The discovery_server probe is not working as expected. The probeDiscovery queue is filling up but not sending messages during some time periods. The queue is building up, not draining quickly enough and the queue is very large. Sometimes messages are not being processed at all.

Environment

Release : 20.3

Component : UIM - DISCOVERY_SERVER

Resolution

Please do the following:

1. Allocate more (e.g., double) Java Memory for the discovery_server probe

In IM, open the raw configuration with Shift - right mouse then click on the probe

Select startup -> opt and update the following values or add 2 GB to the min and to the max above your current settings, for example:

   java_mem_max = -Xmx6144m
   java_mem_init = -Xms4096m

2. Rebuild specific table indexes using a daily job that runs off-hours (if you do not have partitioning enabled)

These are the key tables with indexes that require defragmentation (DAILY)

Ask your DBA to set up a job to run this DAILY (off-hours) index defrag job.

ALTER INDEX ALL ON CM_COMPUTER_SYSTEM REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ATTR REBUILD;
ALTER INDEX ALL ON CM_DEVICE_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_NIMBUS_ROBOT REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ORIGIN REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CI REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CS REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CS_CI REBUILD;
ALTER INDEX ALL ON CM_DISCOVERY_NETWORK REBUILD;
ALTER INDEX ALL ON S_QOS_DATA REBUILD;
ALTER INDEX ALL ON S_QOS_DEFINITION REBUILD;
ALTER INDEX ALL ON S_QOS_SNAPSHOT REBUILD;
ALTER INDEX ALL ON NAS_TRANSACTION_SUMMARY REBUILD;
ALTER INDEX ALL ON NAS_ALARMS REBUILD;

3. Edit udm_manager.cfg

   a. In the " Raw Configure" setup, add the following key under the " Setup" section: " schema_transact_retries" and set the value to 5000.
   b. Also change the "schema_connection_timeout_minutes" value to 90

4. Edit discovery_server.cfg

   a. From the discovery_server probe's Raw Configure GUI, select the setup folder, then in the right-hand pane create a new section called nimbusscan
   b. Select the newly created nimbusscan folder from the left-hand pane, then add the following new key value in the right-hand pane:

       nis_cache_update_interval_secs = 3600

This will increase the discovery scan time so that it has more time to do its work. 

c.  Apply the changes, then click OK/close the window.
d.  Cold start (deactivate, then activate) the discovery_server probe

5. Check udm_manager settings, see setup->datomic section:

    memory_index_max = 516m
    memory_index_threshold = 64m
    object_cache_max = 256m
    heartbeat_interval_msec = 30000

Note that any time the vmware or another storage probe is restarted, larger graphs will make it to the queue and the queue can then fall behind once again but should only fall behind for a while then catch up.

6.  Configure partial graph discovery

For any/all other storage probes (besides vmware since this is implemented in the GA version,) e.g., hp_3par and ibmvm probes, to publish partial graphs every monitoring interval by making the following configuration changes to all of these probes.

a. From the probe's Raw Configure GUI, select the setup folder from the left-hand pane, then add the following key value in the right-hand pane:

     discovery_server_version = 9.02

NOTE:  The version does not have to match the version of the discovery_server installed, it just has to ba a value of 8.2 or higher.

b. Apply the change
c. Cold start (Deactivate, then Activate) the probe.

When and if the probeDiscovery queue has fallen behind, to allow it to catch up, it can be emptied, or disabled for a while, then re-enabled.

Note that any time the vmware or other busy storage probes are restarted, full graph objects will make it to the queue and the queue may then fall behind once again.

7. Increase hub postroute_reply_timeout

On the Primary hub or child robot where discovery_server has been deployed, increase the postroute_reply_timeout, for example to 300.

This value is in seconds, and determines how long the hub will wait for a reply from any queue/subscriber after sending messages. The default = 180. It controls how long the hub waits (in seconds) for a reply from the remote hub after sending a bulk of messages on a queue before deciding that it didn't go through and then re-sends the bulk. is backing up and not working as expected

Additional Information

If the udm_inventory queue turns yellow as per the hub GUI Status view and is not processing any data, or the probe is throwing a memory error in the log, for example:

DEBUG [main, udm_manager] Calling Peer.createDatabase to establish Datomic connection. 
DEBUG [main, udm_manager] Exception establishing Datomic connection: :db.error/not-enough-memory (datomic.objectCacheMax + datomic.memoryIndexMax) exceeds 75% of JVM RAM

 

Try the following:

 

1. Deactivate udm_manager

2. Rt-Click and delete the udm_manager probe

3. Delete the udm_manager folder on the Primary hub file system

4. Redeploy the udm_manager probe from the local archive on the Primary

5. Delete and Recreate the probeDiscovery queue

 

 

6. Cold start the discovery_server (Deactivate-Activate)

Additionally, we have seen scenarios like this, where discovery queues may backup after restarting services. But, as long as the queues are connected (green) the messages should eventually get processed.

Problems usually occur when the queue status is yellow. You can use Dr.Nimbus to view the messages in the queue, but  you might need to restart the nimsoft services again. But as long as the queues are in contact, at times the queues grow, but gradually the messages should be processed and clear out. In most cases, all that it needs is some time.

Also check for slow IO/disk via Task Manager/Resource monitor.

Disks should be SSD and Disk Queue Length should remain below 1 so please check both.

 

NOTE:

If everything listed above in this KB Article has been addressed but the discovery_server itself also seems to be having problems and not working consistently, e.g., performance/scalability issues, and/or its also also consuming large amounts of memory and an increase in available memory to the probe has not improved performance, then it may be worth 'offloading' the discovery_server to a child robot of the Primary:

How to offload the discovery_server to a child robot of the Primary Hub
https://knowledge.broadcom.com/external/article/135036/