data_engine best practices - settings, performance optimization, and discussion of Database Server Best Practices

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Potential Benefits to be derived

- Enhance and/or Sustain 24x7 QOS metric data collection!
- Ensure the health of the data_engine and data processing
- Application of Database Server Best Practices
- Optimize QOS data processing performance
- Achieve the Goal of Lights-out data management that lasts more than a few years
- Keep Data Administration and Maintenance to a bare minimum
- data_engine knowledge sharing
- Provide feedback to Development for any new issues or improvements

Environment

Release: 20.4 or higher
Component: UIM - DATA_ENGINE
data_engine
hub

Cause

Guidance
Proactive engagement

Resolution

Summary

Fine-tuning of data_engine, discussion of DB Best Practices, improved data_engine throughput, performance optimization, and more.

hub

Changed and adjusted hub postroute_reply_timeout from default of 180 to 300.
This value is also in seconds, and determines how long the hub will wait for a reply from any queue/subscriber after sending messages. It controls how long the hub waits (in seconds) for a reply from the remote hub after sending a bulk of messages on a queue before deciding that it didn't go through and then resends the bulk.

data_engine

Checked and improved data_engine throughput

Changed hub_bulk_size from the existing setting of 1000 to 1750.
Waited up to 20 minutes... after repeated inspection during the session, we saw that avg. throughput was increased by +20,000 messages from an avg of approx. 41k, up to 65k-ish.

Checked thread_count
Set data_engine thread_count to 24. The value of 24 was found to be optimal in lab testing. Any higher value hit the law of diminishing returns. You can check the throughput of the QOS Messages via the data_engine GUI before and after changing the setting and waiting at least 15 ins to see if the number of messages processed per minute increases.

Checked partitioning (EXTREMELY IMPORTANT!!!)
How to determine if data_engine partitioning is working as expected in DX UIM

- data retention settings were being honored.

- Note that partitioning is supported when enabled via the Admin Console and for Microsoft SQL Server Enterprise editions not for Standard editions.

Set data_engine data_management_timeout to 65536

This raw configuration parameter controls the time limit for data management jobs, and it is recommended to set it to 65536 to ensure that maintenance tasks like indexing complete successfully without timing out. Default is 7200 (seconds or 2 hours). 65536 is roughly 18.2 hours. Maintenance jobs should not have to run over 18 hours.

Identified Large Tables

Checked large tables via large table query and the largest was 323M. There were approx. 50 tables over 10M. Tables from 50M to over 100M may adversely affect database queries, OC performance or data_engine maintenance jobs. DBAs can help with purging data from large tables if the tables have grown out of control and need to be decreased.

The data_engine GUI Status Tab can be clicked to open up a view of the amount of raw QOS being saved for a specific QOS object in a given table, e.g., QOS_PROCESS_MEMORY.

Note that the data_engine settings can be overridden with a more conservative value to save data for a smaller number of days, especially if the monitoring interval is aggressive but necessary.

S_QOS_DATA (a.k.a. the QOS Object 'Index' or Data catalogue/QOS Metadata)

5M rows. There are no known issues related to this size but duly noted and at some point, the customer may want to get rid of any orphaned data. We have a process to accomplish it. In some customer environments, >50M rows may slow down queries related to Operator Console, Reports, and Dashboards.

Overall health check for data_engine and DATA management/administration completed.

===================================================================