data_engine queuing millions of messages and DB growing fast post-upgrade to 20.3

Products

DX Unified Infrastructure Management (Nimsoft / UIM) Unified Infrastructure Management for Mainframe CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

After upgrading to UIM from 20.3.2 to 20.3.3, customer reports a strange behaviour. He had to extend the database from 750GB to 1TB in one week. The data_engine queue seems to be queuing a lot of data and it is unable to insert it into the DB. The queues are backed up.

Parameters had been set to:

Delete raw data older than: 60
Delete historic data older than: 60
Delete daily average data older than: 90

And customer changed to:
Delete raw data older than: 25
Delete historic data older than: 25
Delete daily average data older than: 30

But the issue persists.

Today in one single day the database had already used 24GB and keeps growing.

Customer is unable to keep extending the disk space.

Environment

Release: 20.3
Component: UIM - DATA_ENGINE
data_engine

Cause

multiple causes

Resolution

Steps taken:

Set hub_bulk_size in data_engine to 1750 (from 2000).
Set thread_count_insert to 24 from 5. There are 24 virtual processors on the Primary hub.
Cold started data_engine (Deactivate-Activate)
data_engine cleared in about 20 minutes. ~43 million messages.

UIM Database-> 1614 tables. MOST of the tables have data from 3/5/2021 up until today and a few back to 3/6/2021.

Set raw and historical retention from its original of 60 and 60 to 25 and 25 (customer-set), but then I reset it to 35, 180 respectively for raw and historical.

Ran select top (1) sampletime from RN_QOS_DATA_0001 and other RN tables to gauge the oldest data which was only 30/31 days so it was safe to set the data_engine retention values lower without ending up with long running hung jobs.

SQL Server had a hung job which finally finished. This was the job.

delete from ssrv2audittrail where id in ( select top(@P0)  st.id  from ssrv2audittrail st where st.timestamp < @P1 order by st.timestamp asc )                

Some data_engine log entries showed LONG INSERT times for QOS_INTERFACE_* data, e.g., 3331177 = 55 minutes to insert that data for QOS_INTERFACE_BYTESOUT and other QOS_INTERFACE_* tables.

Note that via the data_engine GUI, you can double-click on a given QOS object so specific QOS can be pruned at a lower, more conservative setting, e.g., 5 days for raw, 15 for historical to keep the size of the tables down. Most of the Interface tables were very large (21 Tables). Tables may be highly fragmented.

During the time period in which the data_engine was highly queued up, the udm_manager probe turned red

udm_manager probe was constantly restarting, unable to connect so we adjusted the udm_manager.cfg settings (see below) and reinstalled it as well and it recovered.

    memory_index_max = 516m
    memory_index_threshold = 64m
    object_cache_max = 256m
    heartbeat_interval_msec = 30000

Truncated super large MCS audit tables:

TRUNCATE TABLE SSRV2AuditTrailModification;
ALTER TABLE SSRV2AuditTrailModification DROP CONSTRAINT FK_SSRV2AuditTrailModification_SSRV2AuditTrail;
TRUNCATE TABLE SSRV2AuditTrail;
ALTER TABLE SSRV2AuditTrailModification WITH NOCHECK ADD  CONSTRAINT FK_SSRV2AuditTrailModification_SSRV2AuditTrail FOREIGN KEY(auditrecord)
REFERENCES SSRV2AuditTrail(id);
ALTER TABLE SSRV2AuditTrailModification NOCHECK CONSTRAINT FK_SSRV2AuditTrailModification_SSRV2AuditTrail;

Other recommendations provided:

64 GB RAM is dedicated to SQL Server but usage is 57 out of 64GB so recommended setting it to 85 GB RAM to be dedicated to the UIM DB.
Advised to keep track of the size (number of rows) of the SSRV2AuditTrail and SSRV2AuditTrailModification tables to make sure they are now being pruned every 30 minutes.
Keep track of the UIM database .MDF file growth periodically, and if it's still growing too fast or growing too large unexpectedly, ask DBA to check the file to see if they can determine the source of the file growth and open a support case.