Symptoms:
- No alarms or metrics from integration products
- No tickets being created in ServiceNow
- Cannot access Kubernetes / Openshift / DX Console / OI Console.
- Pods crashing
- NFS filesystem full
The following is a high-list of techniques and suggestions to employ to reduce data retention for Elastic:
A) Check Elastic Stats
B) Change data retention to all Tenants
C) Change data retention to a specific tenant
D) Change data retention to specific Elastic indices
E) Disable or reduce Elastic snapshots
F) How to delete specific old indices immediately?
AIOPs Data Stores and Flow Interactions
DX Platform 20.x ( DX OI, DX APM, DX AXA)
A) Check Elastic Stats
NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint
a) Check Elastic indices by size:
http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
For example:
b) Check Elastic health:
http(s)://{es_endpoint}/_cluster/health?pretty&human
For example:
Recommendations:
-Increase disk space in NFS server
-Reduce data retention as documented in next points
B) Change data retention for all TENATS
default retention period is 45 days
"In the OpenShift Web Console, go to the Digital Operational Intelligence project.
Go to Applications, Deployments, doireadserver.
Select Environment to view the environment variables.
Set the value of JARVIS_TENANT_RETENTION_PERIOD as needed.
Click Save."
C) Change data retention to a specific tenant
a) Obtain the tenant_id
NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint
http(s)://{es_endpoint}/ao_tenants_1_1/_search?size=200&pretty
For example
b) Go to Jarvis API onboarding
http(s)://<jarvis-api-endpoint>
c) Change data retention, in this example reduce data retention from default 45 to 10 days
Execute: PATCH /onbaarding/tenants
Body:
{
"product_id":"ao",
"retention_period":<# of days>,
"tenant_id":"<tenant_id>",
}
Example:
{
"product_id":"ao",
"retention_period":25,
"tenant_id":"66C5014F-4D40-4D2B-9882-1CD57DA67D47"
}
Click Execute
Expected Code Result = 204
Verify the change:
Execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')
Expected Code Result = 200, in this example:
NOTE: If Jarvis API endpoint is not available connect to your openshift/kubernetes master node and execute the above operations using the curl command as below:
curl -v -X PATCH -H "Content-Type: application/json" -H "Cache-Control: no-cache" -d '{"product_id" : "ao","tenant_id": "<TENANT-ID>","retention_period": 25}' http://<jarvis-api-endpoint>/onboarding/tenants
D) Change data retention to specific Elastic indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days
NOTE: update http(s)://{es_endpoint} with your own elastic_endpoint
1) Identify which integrations or features are causing the high ingestion of data (UIM, spectrum, capm, caapm, log, anomalies, etc)
To list all indices by creation date:
http(s)://{es_endpoint}/_cat/indices/?v&s=cds:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
To list all incidents by size:
http(s)://{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
We can narrow your search by filtering only specific indices, in this example, it was found that "UIM integration" and "Anomaly DSP" feature were causing a high ingestion of data causing the disk space issue:
To list UIM indices only:
http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
To list anomaly indices:
http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
2) Reduce data retention using Jarvis REST API
a) Obtain the tenant_id
http(s)://{es_endpoint}/ao_tenants_1_1/_search?size=200&pretty
For example
b) Go to Jarvis API onboarding
http(s)://<jarvis-api-endpoint>
c) Display tenant configuration
Execute: GET /onboarding/tenants(product_id={’product_id}’, tenant_id=’{tenant_id}’)
Click Try it out
product _id = ao
tenant_id = <your_tenant id>
Click Execute:
In this example we can see that Tenant is using default data retention of 45 days
d) In this example, we need to reduce data retention for uim and anomaly indices as below:
metrics_uim to 2 days
metrics_anomaly to 15 days
Execute: PATCH /onbaarding/tenants
Body syntax:
{
"product_id":"ao",
"retention_period": <retention_days>,
"tenant_id":"<tenant_id>",
"tenant_doc_type_details":[
{
"doc_type_id":"<doc_type#1>",
"doc_type_version":"<doc_type_version#1>",
"retention_period":<doc_type_rention_days>
},
{
"doc_type_id":"<doc_type#2>",
"doc_type_version":"<doc_type_version#2>",
"retention_period":<doc_type_rention_days>
}
...
]
}
How you obtain the doc_type and doc_type_version for specific indices?
In this example, we are looking for the doc_type definition of the UIM metric index:
Execute: GET /onboarding/doc_type(product_id='{product_id}')
Click Try it out
product_id = ao
Click Execute
we can use the browser search to locate the doc_type defition, in this example "itoa_metrics_uim":
We can now proceed to change the retention at tenant and doc type level, for example : tenant retention = 20, metrics_uim = 2 and metrics_anomaly = 15
{ "product_id":"ao",
"retention_period":20,
"tenant_id":"66C5014F-4D40-4D2B-9882-1CD57DA67D47",
"tenant_doc_type_details":[
{
"doc_type_id":"itoa_metrics_uim",
"doc_type_version":"1",
"retention_period":2
},
{
"doc_type_id":"itoa_metrics_anomaly",
"doc_type_version":"1",
"retention_period":15
}
]
}
Click Execute
Expected Code Result = 204
Verify the change:
Execute: GET /onboarding/tenants(product_id='{product_id}',tenant_id='{tenant_id}')
Expected Code Result = 200, in this example:
NOTE: If Jarvis API endpoint is not available connect to your openshift/kubernetes master node and execute the above operations using the curl command as below:
curl -v -X PATCH -H "Content-Type: application/json" -H "Cache-Control: no-cache" -d '{"product_id" : "ao","tenant_id": "<TENANT-ID>","retention_period": 25,tenant_doc_type_details: [{"doc_type_id":"<doc_type_a>","doc_type_version":"1","retention_period": <retention_period_in_days>>}, {"doc_type_id":"doc_type_b","doc_type_version":"1","retention_period": <retention_period_in_days>}] }' http://<jarvis-api-endpoint>/onboarding/tenants
e) Check when the Purge task will be executed
If Openshift: oc describe po <jarvis-esutils-pod> | grep PURGE
If Kubernetes: kubectl describe po <jarvis-esutils-pod> | grep PURGE
Example:
oc get pods | grep esutils
jarvis-esutils-5c6c695cc5-qr64c 1/1 Running 0 23h
oc describe po jarvis-esutils-5c6c695cc5-qr64c | grep PURGE
BATCH_PURGE_CRON: 0 0 3 * * ?
PURGE_CRON: 0 0 21 * * ?
the above means that the Purge task starts at 9PM, once the maintenance is completed, we can verify the results by looking at:
- List of indices: you will notice that old incidents have been deleted
http(s)://es.<endpoint>/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
- NFS and Elastic servers disk space, run:
dk -h
- Check esutil log
If Openshift:
oc logs -f <jarvis-esutils-pod>
If kubernetes:
kubectl logs -f <jarvis-esutils-pod>
Or we can access directly the log from the NFS server: <NFS-folder>/jarvis/esutils/logs/<jarvis-esutils-pod>/
for example:
oc get pods | grep jarvis-esutils
jarvis-esutils-754768fb68-j55lk 1/1 Running 0 12d
cd /nfs/ca/dxi/jarvis/esutils/logs/jarvis-esutils-754768fb68-j55lk
Below are the list of logs to help you verify that all jarvis esutils tasks are completed correctly:
-jarvis-es-utils.log
-jarvis-es-utils-Rollover.log
-jarvis-es-utils-Purge.log
For example:
tail -f jarvis-es-utils-Purge.log
2021-01-23 21:00:07 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:67 - Staring the purge process...
2021-01-23 21:00:07 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:74 - Product found: ao
2021-01-23 21:00:07 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:78 - Starting purge for product: ao, cluster jarvis_main_es
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_axa_users_by_week_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_axa_users_by_week_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_log4j_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_log4j_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_anomaly_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_metrics_anomaly_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_inventory_servicenow_ci_sa_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_inventory_servicenow_ci_sa_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_apache_error_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_apache_error_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_agg_level2_temp_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_alarms_all_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_alarms_all_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_logs_zos_syslog_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_logs_zos_syslog_1_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] PurgeServiceInvoker:116 - Read alias formed: read_ao_itoa_metrics_prediction_sa_1
2021-01-23 21:00:08 INFO [es_data_purge-thread1611435607731] DataPurge:165 - Skipping the current ingestion index: ao_itoa_metrics_predic
...
If you find an error or exception, open a support case with Broadcom Support and attach the above logs
E) Disable or reduce Elastic snapshots
a)Identify ES backup folder:
http(s)://{es_endpoint}/_snapshot/_all?pretty
For example:
b) Find out total space used:
Go to jarvis-elasticsearch pod > Terminal
Or ssh the pod:
oc get pods | grep elastic
jarvis-elasticsearch-7dcf58587f-zcpqz 1/1 Running 0 12d
oc exec -ti jarvis-elasticsearch-7dcf58587f-zcpqz sh
cd /opt/ca/elasticsearch/backup
du -sh
c) Delete the content of /opt/ca/elasticsearch/backup folder and all subdirectories. This will NOT impact active Elastic data
cd /opt/ca/elasticsearch/backup
rm -rf *
d) Reconfigure snapshots setting so this feature backup data for specific indices only:
Go to "jarvis-esutils" deployment, environments
Adjust the "EXCLUDE_INDICES" setting as needed
Here is an example when using Openshift,
IMPORTANT: We use Java-based Regex, so “.*” is not the same as “*”. If the dot(.) is missed in your regular expression snapshots will keep happening.
e) Disable snapshots by setting the backup to start in a future year, for example: “SNAPSHOT_CRON” = 0 0 23 * * ? 2035
F) How to delete specific old indices immediately?
If you need space available as soon as possible, then you can delete one or more of problematic indices by using
curl -X DELETE http(s)://{es_endpoint}/<index_name>
In below example, we have found that UIM and Anomaly are the problematic indices
http(s)://{es_endpoint}/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds
First, we identify the oldest indices, this example:
ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9
then, we execute : curl -X DELETE http(s)://{es_endpoint}/<index_name> as below:
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_9
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_anomaly_1_10
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_8
curl -X DELETE http(s)://{es_endpoint}/ao_itoa_metrics_uim_1_9
IMPORTANT: always delete the oldest incidents
NFS or Elastic Nodes disk full - How to reduce Elastic data retention in DOI 1.3.2