How does data_engine maintenance and retention work?

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM) Unified Infrastructure Management for Mainframe

Issue/Introduction

"Delete historic data older than" should be set to a higher number than raw data, and the way it works is that after the number of days set for Raw Data are reached, it rolls it into historic tables. In other words, if you have Raw Data set to 30 days, and Historic Data set to 60 days, then from day 1-30 you will have Raw Data and on the 31st day, the oldest raw data gets put into day one of historic data.

This article assumes that you have the option "Compress data before delete" enabled, which is the default setting. With this option set, data_engine retention is broken down into three categories: raw, hourly and daily. Infrastructure Manager is only capable of changing raw and hourly settings. Admin Console or Raw Configure must be utilized if you want to update the daily retention settings.

Environment

- UIM (any version)

Resolution

Data Engine UI for reference:

Data maintenance is a two-step process:

1. For raw data, we look at the retention time, and for data that is older than the retention days, we run a summarization query that stores the hourly summaries into the corresponding hourly (HN) table. Something similar is done to store daily (DN) summaries from the HN table. The process runs on the schedule defined by the customer. This is done in the same way, regardless of whether the table is partitioned or not. The retention days tell you how old, from midnight this morning, the rows in the RN or HN must be in order to be considered for summarization. The summarization query is very expensive, but there is no other way to do it. If the process is running smoothly, the query will only summarize one day of data for each QoS. If the process has stalled or failed for any reason, the query becomes significantly more expensive as more days need to be summarized. Database transactions exacerbate this process, but we are stuck with how the backend RDBMS works. Partitioning can have a VERY positive effect on this summarization query, since summarization and insertion occur against different partitions and there is no contention.

2. Again look at the retention time, and determine what data needs to be deleted, which was summarized above. DN data is obviously not summarized and it is just aged out as per the number of days since midnight this AM. Here is where the process is different for partitioned and un-partitioned tables:

- For partitioned tables, we simply DROP the partitions for old data on each table. Each partition holds data for 1 day. Dropping a partition is a very fast and cheap operation. It is almost instantaneous. This is why we overwhelmingly recommend partitioning databases. For UIM v8.4, we changed the default installation to enable partitioning for the database if it is supported by the backend database.

- For un-partitioned tables, the purging/pruning process is very expensive. The pruning process is driven by a temporal DELETE statement: delete from <table> where sampletime < <retention_time>The delete operation is one of the most expensive queries that you can run on a DB. In this case, it has to scan the index for the range, re-arrange data, indexes, storage, etc.

- It is more expensive than the summarization queries in step 1.

In terms of how the aggregation itself works:

To begin with, we need to define a couple of things regarding the interface. The interface describes the raw and historic options as "delete data older than", however it is better to imagine the options as "summarize data older than". Additionally, the interface describes the second option as historic, but it's better to remember it as "hourly."

With that in mind, if we set raw data to 31 days, and remember it as "summarize raw data older than 31 days", we see that data older than 31 days will be aggregated into hourly averages. The next option will be read as "summarize hourly data older than 180 days". This means that we take the hourly data and roll it up into daily averages.

By default, data_engine tends to be rather aggressive with the amount of data that it tries to retain. Bear in mind that the retention values are additive. With the default values, we store 35 days of raw samples, 180 days of hourly samples and 744 days of daily samples for a total of 959 days. It is not uncommon for an installation to have 500,000 metrics or more.

With an environment monitoring 500,000 metrics, tuning the data retention settings in this manner can mean the difference between storing 35,000,000,000 records and 7,000,000,000. This will have a direct impact on reporting and maintenance performance.

Additional Information

Some additional details on the "sampletime" column for each table type can be found here.

The RN tables contain the actual samples, as collected minute-by-minute and submitted by the monitoring probes; depending on the configuration, metrics will often be sent at intervals of 1 minute, 5 minutes, or 15 minutes.

The exact time at which the sample is submitted is dependent on the exact time the probe/robot was last started or restarted; so that if a robot was started at 8:03, and configured for 5-minute intervals, we might see data points collected at 8:08, 8:13, 8:18, etc. Another robot might be submitting the same metric, but was started a minute later, so the samples for that robot might be at 8:09, 8:14, 8:19 and so on.

During the maintenance/compression process detailed above, the HN tables will be populated with an aggregated value that goes hour-by-hour; in other words, the data_engine looks at the data in the RN tables and creates numerous "buckets" containing 1 hour of data; these "buckets" are then averaged, and the average value for each hour-long period is calculated and stored in the HN table.

The timestamp given to this aggregated value is determined based on the maximum sampletime -- in other words, the most recently collected sample in the RN table. The data_engine uses the sampletime of the most recent sample to determine from which point to "work backwards" creating the 1-hour buckets, so the sampletime recorded in the HN table is based on the most recent sample available (and then each hourly bucket is calculated by subtracting an hour from that sampletime.)

So let's suppose we have values in the RN table with sampletimes like this, and that June 15 represents the last day of "raw data" which is about to be compressed into aggregated/hourly data:

...
6/15/22 11:53:03 pm
6/15/22 11:58:03 pm
6/16/22 12:03:03 am
...

When it is time for data_engine to aggregate the values from 6/15/22 from RN into HN, it will create buckets like this, working in hourly increments backward from the most recent sample:

6/15/22 10:58:03pm - 11:58:03 pm
6/15/22 9:58:03pm - 10:58:03 pm
6/15/22 8:58:03pm - 9:58:03 pm
and so on.

Then it will collect the average value of each bucket, and record that in the HN table with timestamps matching the highest value in that bucket.

The DN tables work the same way, but the data within the HN tables is averaged by day, and the daily average value is then stored in the DN_QOS_DATA tables based on the same methodology as described above.

Large Environments

In large installations, this data purging may not finish in a 24-hour period. This results in data maintenance not finishing, which causes the summarization in step 1 to get progressively more expensive. You quickly get into a hung cycle, which can only be resolved by partitioning the DB and potentially manually deleting old data from the RN tables. The DN and HN tables are generally pretty small, so they are not an issue.

If on top of that, you have inadequate resources, poorly configured DBs, shared storage with other applications, not enough memory dedicated to SQL Server, pretty quickly, things get out of hand and degrade badly in large environments. Partitioning a large, e.g., > 1 TB database, with a significant number of large tables (>100+ million rows), and an already overloaded/struggling DB, is like doubling the resource requirements until partitioning is completed.

You need to have enough space available on the primary filegroup to enable partitioning.