DX Platform - ElasticSearch disk Full - How to reduce Elastic data retention?

book

Article ID: 207161

calendar_today

Updated On:

Products

DX Operational Intelligence CA App Experience Analytics DX Application Performance Management

Issue/Introduction

Symptoms:

- No alarms or metrics from integration products
- No tickets being created in ServiceNow
- Cannot access Kubernetes / Openshift / DX Console / OI Console.
- Pods crashing
- NFS filesystem full

 

Cause

The following is a high-list of techniques and suggestions to employ to reduce data retention for Elastic:


A) Check Elastic Stats
B) Change data retention to all Tenants
C) Change data retention to a specific tenant
D) Change data retention to specific Elastic indices
E) Disable or reduce Elastic snapshots
F) How to delete specific old indices immediately?

 

AIOPs Data Stores and Flow Interactions

Environment

DX Platform 20.x ( DX OI, DX APM, DX AXA)

Resolution

A) Check Elastic Stats

NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint

a) Check Elastic indices by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

For example:

b) Check Elastic health:

http(s)://{es_endpoint}/_cluster/health?pretty&human

For example:

Recommendations:

-Increase disk space in NFS server
-Reduce data retention as documented in next points



B) Change data retention for all TENATS

default retention period is 45 days

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/digital-operational-intelligence/20-2/configuring/configure-data-retention.html 

"In the OpenShift Web Console, go to the Digital Operational Intelligence project.
Go to Applications, Deployments, doireadserver.
Select Environment to view the environment variables.
Set the value of JARVIS_TENANT_RETENTION_PERIOD as needed.
Click Save."



C) Change data retention to a specific tenant

a) Obtain the tenant_id 

NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint

http(s)://{es_endpoint}/ao_tenants_1_1/_search?size=200&pretty 

For example

b) Go to Jarvis API onboarding

http(s)://<jarvis-api-endpoint>


c) Change data retention, in this example reduce data retention from default 45 to 10 days

Execute: PATCH /onbaarding/tenants

Body:

{
  "product_id":"ao",
  "retention_period":<# of days>,
  "tenant_id":"<tenant_id>",
}

Example:

{
  "product_id":"ao",
  "retention_period":25,
  "tenant_id":"66C5014F-4D40-4D2B-9882-1CD57DA67D47"
}

Click Execute

Expected Code Result = 204 

 

Verify the change:

Execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Expected Code Result = 200, in this example:

 

NOTE: If Jarvis API endpoint is not available connect to your openshift/kubernetes master node and execute the above operations using the curl command as below:

curl -v -X PATCH -H "Content-Type: application/json" -H "Cache-Control: no-cache" -d '{"product_id" : "ao","tenant_id": "<TENANT-ID>","retention_period": 25}' http://<jarvis-api-endpoint>/onboarding/tenants



D) Change data retention to specific Elastic indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days

https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/digital-operational-intelligence/20-2/configuring/configure-data-retention.html 

 

NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint

1) Identify which integrations or features are causing the high ingestion of data (UIM, spectrum, capm, caapm, log, anomalies, etc) 

To list all indices by creation date:

http(s)://{es_endpoint}/_cat/indices/?v&s=cds:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list all incidents by size:

http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

 

We can narrow your search by filtering only specific indices, in this example, it was found that "UIM integration" and "Anomaly DSP" feature were causing a high ingestion of data causing the disk space issue:

To list UIM indices only:

http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

To list anomaly indices:

http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

2) Reduce data retention using Jarvis REST API 

a) Obtain the tenant_id 

http(s)://{es_endpoint}/ao_tenants_1_1/_search?size=200&pretty 

For example

b) Go to Jarvis API onboarding

http(s)://<jarvis-api-endpoint>

c) Display tenant configuration

Execute: GET /onboarding/tenants(product_id={’product_id}’, tenant_id=’{tenant_id}’)

Click Try it out

product _id = ao
tenant_id = <your_tenant id>

Click Execute:

In this example we can see that Tenant is using default data retention of 45 days

d) In this example, we need to reduce data retention for uim and anomaly indices as below:

metrics_uim to 2 days
metrics_anomaly to 15 days


Execute: PATCH /onbaarding/tenants

Body syntax:
{
  "product_id":"ao",
  "retention_period": <retention_days>,
  "tenant_id":"<tenant_id>",
  "tenant_doc_type_details":[
    {
      "doc_type_id":"<doc_type#1>",
      "doc_type_version":"<doc_type_version#1>",
      "retention_period":<doc_type_rention_days>
    },
    {
      "doc_type_id":"<doc_type#2>",
      "doc_type_version":"<doc_type_version#2>",
      "retention_period":<doc_type_rention_days>
    }

    ...
  ]
}

 

How you obtain the doc_type and doc_type_version for specific indices?

In this example, we are looking for the doc_type definition of the UIM metric index:

Execute: GET /onboarding/doc_type(product_id='{product_id}')

Click Try it out

product_id = ao

Click Execute

 

we can use the browser search to locate the doc_type defition, in this example "itoa_metrics_uim":

We can now proceed to change the retention at tenant and doc type level, for example : tenant retention = 20, metrics_uim = 2 and metrics_anomaly = 15 

 { "product_id":"ao",
  "retention_period":20,
  "tenant_id":"66C5014F-4D40-4D2B-9882-1CD57DA67D47",
  "tenant_doc_type_details":[
    {
      "doc_type_id":"itoa_metrics_uim",
      "doc_type_version":"1",
      "retention_period":2
    },
    {
      "doc_type_id":"itoa_metrics_anomaly",
      "doc_type_version":"1",
      "retention_period":15
    }
  ]
}

Click Execute

Expected Code Result = 204 

 

Verify the change:

Execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')

Expected Code Result = 200, in this example:

 

NOTE: If Jarvis API endpoint is not available connect to your openshift/kubernetes master node and execute the above operations using the curl command as below:

curl -v -X PATCH -H "Content-Type: application/json" -H "Cache-Control: no-cache" -d '{"product_id" : "ao","tenant_id": "<TENANT-ID>","retention_period": 25,tenant_doc_type_details: [{"doc_type_id":"<doc_type_a>","doc_type_version":"1","retention_period": <retention_period_in_days>>}, {"doc_type_id":"doc_type_b","doc_type_version":"1","retention_period": <retention_period_in_days>}] }' http://<jarvis-api-endpoint>/onboarding/tenants

 

e) Check when the Purge task will be executed

If Openshift: oc describe po <jarvis-esutils-pod> | grep PURGE

If Kubernetes: kubectl describe po <jarvis-esutils-pod> | grep PURGE

Example:

oc get pods | grep esutils
jarvis-esutils-5c6c695cc5-qr64c                        1/1       Running     0          23h

oc describe po jarvis-esutils-5c6c695cc5-qr64c | grep PURGE
      BATCH_PURGE_CRON:          0 0 3 * * ?
      PURGE_CRON:                0 0 21 * * ?

the above means that the Purge task starts at 9PM, once the maintenance is completed, we can verify the results by looking at:

- List of indices: you will notice that old incidents have been deleted

http(s)://es.<endpoint>/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

- NFS and Elastic servers disk space, run:

dk -h

- Check esutil log

If Openshift:

oc logs -f <jarvis-esutils-pod>

If kubernetes: 

kubectl logs -f <jarvis-esutils-pod>


Or we can access directly the log from the NFS server: <NFS-folder>/jarvis/esutils/logs/<jarvis-esutils-pod>/

for example:

oc get pods | grep jarvis-esutils
jarvis-esutils-754768fb68-j55lk                        1/1       Running     0          12d

cd /nfs/ca/dxi/jarvis/esutils/logs/jarvis-esutils-754768fb68-j55lk


Below are the list of logs to help you verify that all jarvis esutils tasks are completed correctly:

-jarvis-es-utils.log
-jarvis-es-utils-Rollover.log
-jarvis-es-utils-Purge.log

For example:

tail -f jarvis-es-utils-Purge.log

2021-01-23 21:00:07 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:67 - Staring the purge process...
2021-01-23 21:00:07 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:74 - Product found: ao
2021-01-23 21:00:07 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:78 - Starting purge for product: ao, cluster jarvis_main_es
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_axa_users_by_week_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_axa_users_by_week_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_log4j_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_log4j_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_anomaly_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_metrics_anomaly_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_inventory_servicenow_ci_sa_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_inventory_servicenow_ci_sa_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_apache_error_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_apache_error_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_agg_level2_temp_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_alarms_all_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_alarms_all_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_zos_syslog_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_zos_syslog_1_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_prediction_sa_1
2021-01-23 21:00:08 INFO  [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_metrics_predic

...

If you find an error or exception, open a support case with Broadcom Support and attach the above logs

 

E) Disable or reduce Elastic snapshots

a)Identify ES backup folder:

http(s)://{es_endpoint}/_snapshot/_all?pretty

For example:



b) Find out total space used:

Go to jarvis-elasticsearch pod > Terminal

Or ssh the pod:

oc get pods | grep elastic
jarvis-elasticsearch-7dcf58587f-zcpqz                  1/1       Running     0          12d

oc exec -ti jarvis-elasticsearch-7dcf58587f-zcpqz sh


cd /opt/ca/elasticsearch/backup
du -sh


c) Delete the content of /opt/ca/elasticsearch/backup folder and all subdirectories. This will NOT impact active Elastic data

cd /opt/ca/elasticsearch/backup
rm -rf *

d) Reconfigure snapshots setting so this feature backup data for specific indices only:

Go to "jarvis-esutils" deployment, environments

Adjust the "EXCLUDE_INDICES" setting as needed

Here is an example when using Openshift, 

 

IMPORTANT: We use Java-based Regex, so “.*” is not the same as “*”. If the dot(.) is missed in your regular expression snapshots will keep happening.

e) Disable snapshots by setting the backup to start in a future year, for example: “SNAPSHOT_CRON” = 0 0 23 * * ? 2035




F) How to delete specific old indices immediately?

If you  need space available as soon as possible, then you can delete one or more of problematic indices by using

curl -X DELETE http(s)://{es_endpoint}/<index_name>

In below example, we have found that UIM and Anomaly are the problematic indices

http(s)://{es_endpoint}/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds

First, we identify the oldest indices, this example:

ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9

then, we execute : curl -X DELETE http(s)://{es_endpoint}/<index_name> as below:

curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_9
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_10
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_8
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_9

IMPORTANT: always delete the oldest incidents

 

Additional Information

NFS or Elastic Nodes disk full - How to reduce Elastic data retention in DOI 1.3.2

Attachments