AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?
search cancel

AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?

book

Article ID: 207161

calendar_today

Updated On:

Products

DX Operational Intelligence CA App Experience Analytics DX Application Performance Management

Issue/Introduction

Symptoms:

- No alarms or metrics from integration products
- No tickets being created in ServiceNow
- Cannot access Kubernetes / Openshift / DX Console / OI Console.
- Pods crashing
- NFS filesystem full

 

 

Environment

DX Platform 2x

Cause

The following is a high-list of techniques and suggestions to employ to reduce data retention for Elastic:


A) Check Elastic Stats
B) Change data retention to all Tenants
C) Change data retention to a specific tenant
D) Change data retention to specific Elastic indices
E) Disable or reduce Elastic snapshots
F) How to delete specific old indices immediately?

 

AIOPs Data Stores and Flow Interactions

Resolution

A) Check Elastic Stats

NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint

a) Check Elastic indices by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

For example:

b) Check Elastic health:

http(s)://{es_endpoint}/_cluster/health?pretty&human

For example:

Recommendations:

-Increase disk space in NFS server
-Reduce data retention as documented in next points



B) Change data retention for all TENATS

default retention period is 45 days

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/digital-operational-intelligence/20-2/configuring/configure-data-retention.html 

"In the OpenShift Web Console, go to the Digital Operational Intelligence project.
Go to Applications, Deployments, doireadserver.
Select Environment to view the environment variables.
Set the value of JARVIS_TENANT_RETENTION_PERIOD as needed.
Click Save."



C) Change data retention to a specific tenant

a) Obtain the tenant_id from  Settings > Connector Parameters > Cohort ID

b) Go to Jarvis API onboarding

http(s)://<jarvis-api-endpoint>

c) Change data retention, for example reduce data retention from default 45 to 10 days

Execute: PATCH /onbaarding/tenants

Body:

{
  "product_id":"ao",
  "retention_period":<# of days>,
  "tenant_id":"<tenant_id>",
}

Click Execute, expected Code Result = 204 

Verify the change, execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Product_id = ao

Enter the tenant id

Click Execute, expected Code Result = 200

 

D) Change data retention to specific Elastic indices, for example: metrics_uim to 2 days  AND metrics_anomaly to 30 days

1) Identify which integrations or features are causing the high ingestion of data (UIM, spectrum, capm, caapm, log, anomalies, etc) 

To list all indices by creation date:

http(s)://{es_endpoint}/_cat/indices/?v&s=cds:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list all incidents by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

You can narrow your search by filtering only specific indices, for example to list UIM indices only:

http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list anomaly indices:

http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

2) Reduce data retention using  PATCH /onbaarding/tenants

Body syntax:
{
  "product_id":"ao",
  "retention_period": <retention_days>,
  "tenant_id":"<tenant_id>",
  "tenant_doc_type_details":[
    {
      "doc_type_id":"<doc_type#1>",
      "doc_type_version":"<doc_type_version#1>",
      "retention_period":<doc_type_rention_days>
    },
    {
      "doc_type_id":"<doc_type#2>",
      "doc_type_version":"<doc_type_version#2>",
      "retention_period":<doc_type_rention_days>
    }

    ...
  ]
}

How you obtain the doc_type and doc_type_version for specific indices?

In this example, we are looking for the doc_type definition of the UIM metric index:

Execute: GET /onboarding/doc_type(product_id='{product_id}')

Click Try it out

product_id = ao

Click Execute

we can use the browser search to locate the doc_type defition, in this example "itoa_metrics_uim":

We can now proceed to change the retention at tenant and doc type level, for example : tenant retention = 20, metrics_uim = 2 and metrics_anomaly = 15 

 { "product_id":"ao",
  "retention_period":20,
  "tenant_id":"<your_tenant_id>",
  "tenant_doc_type_details":[
    {
      "doc_type_id":"itoa_metrics_uim",
      "doc_type_version":"1",
      "retention_period":2
    },
    {
      "doc_type_id":"itoa_metrics_anomaly",
      "doc_type_version":"1",
      "retention_period":15
    }
  ]
}

Click Execute, expected Code Result = 204 

Verify the change, execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Expected Code Result = 200, 

 

E) How to delete specific old indices immediately?

If you  need space available as soon as possible, then you can delete one or more of problematic indices by using

curl -X DELETE http(s)://{es_endpoint}/<index_name>

In below example, we have found that UIM and Anomaly are the problematic indices

http(s)://{es_endpoint}/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds

First, we identify the oldest indices, this example:

ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9

then, we execute : curl -X DELETE http(s)://{es_endpoint}/<index_name> as below:

curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_9
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_10
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_8
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_9

IMPORTANT: always delete the oldest incidents

Additional Information

https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html