Performance Manager Event raised for GroupsETLJob Batch process job exceeding 1 minute interval

Products

CA Infrastructure Management CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

There are Administrative Events showing up in Performance Management for the GroupsETLJob Batch process job exceeding it's time frame to run.

There may also be some Events raised along with these time frame events that indicate the GroupsETLJob Batch process job failed.

The time frame exceeded Event Description states:

Batch process job GroupsETLJob required more than 80% of the currently configured execution interval of 1 minutes.

The job failed Event Description states:

Batch process job GroupsETLJob failed.

There are no errors in the Performance Center or Data Aggregator logs that relate to the Events raised.

ETL related Self Monitoring metrics show fluctuating times for ETL processing at the same time the Events are raised. This should be showing steadier less inconsistent processing times. Why is it so inconsistent?

Environment

All supported Performance Management releases

Cause

Over-utilized Data Repository server memory resources.

Resolution

The inconsistent ETL metric times combined with no errors in the logging that align with the problem Events points to a performance/resource related problem.

The product is using an ETL pool that allows use of 5% of total allocated memory to complete the job. It may be that natural growth of the Performance Management environment has led to this being too little for the environment. As a result it's consuming all 5% of allowed memory and needs even more to perform it's job. It's not getting enough memory to have queries run faster.

While it's possible that we could increase the memory for the pool, it's not recommended to resolve this. It will only take resources away from something else resulting in other problems.

For the 80% of execution time Events, the good news is that it means that while it's approaching it's limit, it's still completing within that 1 minute limit without failure. The product raises the Event to provide a warning that it's over 80%. It does indicate it could go over 100% which then triggers failure.

For the failure related Events, it should be run again successfully. As long as the failure Events aren't showing up frequently, nothing is being lost.

In some situations 1 minute runs for this group ETL can be too frequent. In these situations we suggest backing down to a interval that runs where the Events no longer appear or appear far less frequently.

Possible Impact to increasing the interval longer than every 1 minute? Any Group based Scorecard report View may not show new Group changes synchronized for use in under 5 minutes.

We recommend starting with a change from 1 minute to 3 minutes as a first step. After setting it, review incoming Events. Does it alleviate most of the 80% of execution time Events? Maybe it resolves both those and the failure Events?

If not and the Events are less frequent but still too noisy raise the interval from 3 to 5 minutes.

How do we change the interval? Follow these instructions, which use a support lab as an example.

1. In a REST client run a GET against the URL:

http://DA:8581/rest/batch/groups/config

2. Output should look like:

</GroupsBatchConfiguration>

</GroupsBatchConfigurationList>

3. Set the REST client to use a PUT.

4. Set the URL to include the ID from the GET request. Sample from above info would be: http://DA:8581/rest/batch/groups/config/468

5. In the BODY of the PUT request enter the following to change Interval value from 1 to 3:

</GroupsBatchConfiguration>

6. Hit Send. If a 200 success message is received, run the GET request again to confirm the change.