data engine queued up and not processing QOS messages

Products

DX Unified Infrastructure Management (Nimsoft / UIM) DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

The data engine is queued and not processing, therefore we are unable to seethe latest data/metrics in the Operator Console (OC). The data_engine queued messages column value is very high (millions of QOS messages).

Environment

Release: 20.3.3 or higher
Component: UIM - DATA_ENGINE
data_engine v20.31 or higher
Oracle 12c

Cause

data_engine could not maintain the connection to the hub and therefore it lost connection to its ATTACH queue.

Resolution

In the data_engine we already changed the hub_bulk_size down from 2000 to 1750. I decreased it slightly, because sometimes going as high as 2000 may cause inconsistencies in performance or connection between the data_engine and the hub.

Restart or cold start of the data_engine had no effect, and the most important error persisted.

de: qos_check - subscriber attached to queue: communication error (bulk size=2000) (queue: data_engine)

The data_engine could connect to the backend database with no problem. This was confirmed via test of the connection via the data_engine GUI. That's why we started looking at the hub.log which led us to change the hub settings as well.

postroute_reply_timeout-> increased from 180 to 300
hub_request_timeout-> increased from 120 to 240

This ultimately resolved the issue.

thread_count_insert was only set to 2, changed it to 4, not 24 (which is normally optimal (despite the actual number of cores), but we wanted to let it finish processing. I wanted to try 24 but the queue was humongous due to the data_engine queue having trouble connecting/subscribing to the hub queue, so I let it ride and it cleared a huge number of messages in record time.

The Primary hub had 2 cores, 16 virtual processors (virtual server), and the processor speed is 3GHz.

After we made the changes to the hub.cfg, data_engine.cfg hub_bulk_size and thread_count_insert, it was processing 1.22M messages or more per min which is VERY fast and excellent throughput!!!

Root cause -> data_engine could not maintain the connection to the hub and hence lost connection to its ATTACH queue.

We changed the hub settings because 2 of them were set too low (default) which most likely caused the data_engine queue to stop connecting, but another reason might be that the number of hub subscribers was too close to the limit on Windows which is 64.

The hub.cfg can be adjusted to send an alarm when the number of subscribers hits a certain value and we recommend setting this up and send EMAILs to the support group or individual as well via nas AO Profile when and if it occurs (e.g., when it hits around 50).

For more details please refer to the following KB Article:

UIM hub subscriber limits and how to monitor the count of subscribers
https://knowledge.broadcom.com/external/article/33649/

For a detailed explanation of the hub settings and what they do, please do check out this KB Article:

hub configuration - timeout, retry and other settings (explained)
https://knowledge.broadcom.com/external/article/97954

Note on Index Maintenance

We do not recommend leaving the data_engine 'Index Maintenance' enabled because in large environments specific tables become so large that quite often the jobs to defrag the indexes take a very long time or simply fail and in the process use up a lot of DB resource. This affects the overall performance of the database. But you must setup the daily job recommended to defrag some select tables which helps performance for OC and overall DB performance in general.

These are the key tables with indexes that require defragmentation (DAILY)

Ask your DBA to set up a job to run this DAILY table index defrag job (during non-business hours if possible).

ALTER INDEX ALL ON CM_COMPUTER_SYSTEM REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ATTR REBUILD;
ALTER INDEX ALL ON CM_DEVICE_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_METRIC_DEFINITION REBUILD;
ALTER INDEX ALL ON CM_NIMBUS_ROBOT REBUILD;
ALTER INDEX ALL ON CM_DEVICE REBUILD;
ALTER INDEX ALL ON CM_COMPUTER_SYSTEM_ORIGIN REBUILD;
ALTER INDEX ALL ON CM_CONFIGURATION_ITEM_ATTRIBUTE REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CI REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CI_CS REBUILD;
ALTER INDEX ALL ON CM_RELATIONSHIP_CS_CI REBUILD;
ALTER INDEX ALL ON CM_DISCOVERY_NETWORK REBUILD;
ALTER INDEX ALL ON S_QOS_DATA REBUILD;
ALTER INDEX ALL ON NAS_TRANSACTION_SUMMARY REBUILD;
ALTER INDEX ALL ON NAS_ALARMS REBUILD;

The above statements are for MS SQL Server so for an Oracle Database the sql statement format would be the following but ask your Oracle DBA.

ALTER INDEX <index_name> REBUILD;

Aside from the above, in almost all cases within a large environment, Partitioning is sufficient to optimize the UIM backend database. But the DBA should also check memory pressure every month to a few months and increase the memory dedicated to the UIM DB instance accordingly.