AIOps - ElasticSearch disk Full - How to reduce Elastic data retention?

Products

DX Operational Intelligence CA App Experience Analytics DX Application Performance Management

Issue/Introduction

Symptoms:

- No alarms or metrics from integration products
- No tickets being created in ServiceNow
- Cannot access Kubernetes / Openshift / DX Console / OI Console.
- Pods crashing
- NFS filesystem full

Environment

DX Platform on premise 24.x
DX Platform on premise 23.x
DX Platform on premise 22.x

Cause

The following is a high-list of techniques and suggestions to employ to reduce data retention for Elastic:

A) Check Elastic Stats
B) Change data retention to all Tenants
C) Change data retention to a specific tenant
D) Change data retention to specific Elastic indices
E) Disable or reduce Elastic snapshots
F) How to delete specific old indices immediately?

AIOPs Data Stores and Flow Interactions

Resolution

A) Check Elastic Stats

NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint

a) Check Elastic indices by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

For example:

b) Check Elastic health:

http(s)://{es_endpoint}/_cluster/health?pretty&human

For example:

Recommendations:

-Increase disk space in NFS server
-Reduce data retention as documented in next points

B) Change data retention for all TENATS

default retention period is 45 days

You can change the retention_period value to how many days customer want to store the data.
I have added "doc_type_id": "axa_session_hdr", you can specify any index doc_type_id to reduce the retention.

Exec into jarvis-apis pod, then execute this curl command,
curl --location --request PATCH 'http://localhost:8080/onboarding/doc_type' \
--header 'Content-Type: application/json' \
--data '{
"product_id": "ao",
"doc_type_id": "axa_session_hdr",
"doc_type_version": "1",
"retention_period": 30
}'

C) Change data retention to a specific tenant

a) Obtain the tenant_id from Settings > Connector Parameters > Cohort ID

b) Go to Jarvis API onboarding

http(s)://<jarvis-api-endpoint>

c) Change data retention, for example reduce data retention from default 45 to 10 days

Execute: PATCH /onboarding/tenants

Body:

{
"product_id":"ao",
"retention_period":<# of days>,
"tenant_id":"<tenant_id>",
}

Click Execute, expected Code Result = 204

Verify the change, execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Product_id = ao

Enter the tenant id

Click Execute, expected Code Result = 200

D) Change data retention to specific Elastic indices, for example: metrics_uim to 2 days AND metrics_anomaly to 30 days

1) Identify which integrations or features are causing the high ingestion of data (UIM, spectrum, capm, caapm, log, anomalies, etc)

To list all indices by creation date:

http(s)://{es_endpoint}/_cat/indices/?v&s=cds:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list all incidents by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

You can narrow your search by filtering only specific indices, for example to list UIM indices only:

http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list anomaly indices:

http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

2) Reduce data retention using PATCH /onbaarding/tenants

Body syntax:
{
"product_id":"ao",
"retention_period": <retention_days>,
"tenant_id":"<tenant_id>",
"tenant_doc_type_details":[
{
"doc_type_id":"<doc_type#1>",
"doc_type_version":"<doc_type_version#1>",
"retention_period":<doc_type_rention_days>
},
{
"doc_type_id":"<doc_type#2>",
"doc_type_version":"<doc_type_version#2>",
"retention_period":<doc_type_rention_days>
}

...
]
}

How you obtain the doc_type and doc_type_version for specific indices?

In this example, we are looking for the doc_type definition of the UIM metric index:

Execute: GET /onboarding/doc_type(product_id='{product_id}')

Click Try it out

product_id = ao

Click Execute

we can use the browser search to locate the doc_type defition, in this example "itoa_metrics_uim":

We can now proceed to change the retention at tenant and doc type level, for example : tenant retention = 20, metrics_uim = 2 and metrics_anomaly = 15

{ "product_id":"ao",
"retention_period":20,
"tenant_id":"<your_tenant_id>",
"tenant_doc_type_details":[
{
"doc_type_id":"itoa_metrics_uim",
"doc_type_version":"1",
"retention_period":2
},
{
"doc_type_id":"itoa_metrics_anomaly",
"doc_type_version":"1",
"retention_period":15
}
]
}

Click Execute, expected Code Result = 204

Verify the change, execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Expected Code Result = 200,

E) How to delete specific old indices immediately?

If you need space available as soon as possible, then you can delete one or more of problematic indices by using

curl -X DELETE http(s)://{es_endpoint}/<index_name>

In below example, we have found that UIM and Anomaly are the problematic indices

http(s)://{es_endpoint}/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds

First, we identify the oldest indices, this example:

ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9

then, we execute : curl -X DELETE http(s)://{es_endpoint}/<index_name> as below:

curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_9
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_10
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_8
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_9

IMPORTANT: always delete the oldest incidents

Additional Information

Troubleshooting common issues