ALERT: Some images may not load properly within the Knowledge Base Article. If you see a broken image, please right-click and select 'Open image in a new tab'. We apologize for this inconvenience.

data engine queued up and not processing QOS messages

book

Article ID: 237150

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

The data engine is queued and not processing, therefore we are unable to seethe latest data/metrics in the Operator Console (OC). The data_engine queued messages is very high, millions of messages.

Cause

- connection to hub/data_engine ATTACH queue lost

Environment

Release : 20.3

Component : UIM - DATA_ENGINE

- data_engine v20.31

Resolution

In the data_engine we already changed the hub_bulk_size down from 2000 to 1750. I decreased it slightly, because sometimes going as high as 2000 may cause inconsistencies in performance or connection between the data_engine and the hub.

Restart or cold start of he data_engine had no effect, and the most important error persisted.

   de: qos_check - subscriber attached to queue: communication error (bulk size=2000) (queue: data_engine)

The data_engine could connect to the backend database with no problem. Connection test proved that. That's why we started looking at the hub.log which led us to change the hub settings as well.

   postroute_reply_timeout. increased from 180 to 300
   hub_request_timeout increased from 120 to 240

This ultimately resolved the issue.

thread_count_insert was only set to 2, I changed it to 4, not 24 (which is normally optimal desite the actual number of cores, but we wanted to let it finish processing. I wanted to try 24 but the queue was humongous due to the data_engine queue having trouble connecting/subscribing to the hub queue, so I let it ride and it cleared a huge number of messages in record time.

The Primary hub had 2 cores, 16 virtual processors (virtual server), and the processor speed is 3 GHz. After we made the changes to the hub.cfg, data_engine.cfg hub_bulk_size and thread_count_insert, it was processing 1.22M messages or more per min which is VERY fast and excellent throughput!!!

Root cause -> data_engine could not maintain the connection to the hub and hence lost connection to its ATTACH queue.

We changed the hub settings because 2 of them were set too low (default) which most likely caused the data_engine queue to stop connecting, but another reason might be that the number of hub subscribers was too close to the limit on Windows which is 64. The hub.cfg can be adjusted to send an alarm when the number of subscribers hits a certain value and we recommend you set this up for the customer and send an email to yourself as well via nas AO Profile when and if it occurs (e.g., when it hits around 50).

See-> UIM hub subscriber limits and how to monitor the count of subscribers
https://knowledge.broadcom.com/external/article/33649/

For a detailed explanation of the hub settings and what they do, please do check out this KB Article:

hub configuration - timeout, retry and other settings (explained)
https://knowledge.broadcom.com/external/article/97954

As we discussed, we do not recommend leaving the data_engine 'Index Maintenance' enabled because in large environments specific tables become so large that quite often the jobs to defrag the indexes take a very long time or simply fail and in the process use up a lot of DB resource. This affects the overall performance of the database. But you must setup the daily job I recommended to defrag some select tables which helps performance for OC interface and overall DB performance in general.

These are the key tables with indexes that require defragmentation (DAILY)

Ask your DBA to set up a job to run this DAILY table index defrag job (during non-business hours if possible).

ALTER INDEX ALL ON CM_COMPUTER_SYSTEM REBUILD; 
ALTER INDEX ALL ON CM_DEVICE REBUILD; 
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ATTR REBUILD;
ALTER INDEX ALL ON CM_DEVICE_ATTRIBUTE REBUILD; 
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC REBUILD; 
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_DEFINITION REBUILD; 
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC_DEFINITION REBUILD; 
ALTER INDEX ALL ON CM_NIMBUS_ROBOT REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD; 
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ORIGIN REBUILD; 
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_ATTRIBUTE REBUILD; 
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CI REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CS REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CS_CI REBUILD;
ALTER INDEX ALL ON CM_DISCOVERY_NETWORK REBUILD; 
ALTER INDEX ALL ON S_QOS_DATA REBUILD; 
ALTER INDEX ALL ON NAS_TRANSACTION_SUMMARY REBUILD; 
ALTER INDEX ALL ON NAS_ALARMS REBUILD;

The above statements are for SQL Server so for an Oracle Database I believe the sql statement format would be the following but ask the Oracle DBA.

   ALTER INDEX <index_name> REBUILD;

Aside from the above, in almost all cases within a large environment, Partitioning is sufficient to optimize the UIM backend database. But the DBA should also check memory pressure every few months and increase the memory dedicated to the UIM DB instance accordingly.