Symptoms:
- No alarms or metrics from integration products
- No tickets being created in Servicenow
- Cannot access Openshift/DOI Console.
- Pods crashing
- Possible data corruption
- NFS file system full
Questions:
1) How to identify if the disk space is related to Elastic Data or HDFS?
2) How to reduce global data retention?
3) How to reduce data retention for only specific indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days?
4) How to verify Elastic snapshot size and how to disable or reduce Elastic snapshots?
5) How to delete specific indices for immediate action?
CA DIGITAL OPERATIONAL INTELLIGENCE - 1.3.x
1) How to identify if the disk space usage issue is related to Elastic Data or HDFS?
For ELASTIC:
You can use the below ES queries for your analysis
a) Check Elastic allocated disk space and availability:
{es_endpoint}/_cluster/stats?pretty&human&filter_path=**.fs
For example: this query will clearly tell you if ES is out of disk space
{es_endpoint}/_cluster/health?pretty&human
For example:
{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
For example:
Action to take:
-Increase disk space in NFS server
-Reduce data retention as documented in next points
For HDFS:
Hadoop is used to store CPA, PI, and level1 and level2 aggregation data. It is like the data where jobs run.For example:
The top 5 can be taken, then run the below command for each of tenant like:
"hdfs dfs -du /opt/ca/jarvis/ao/<some folder> | sort -n -r -k 1"
this command will give us the doc types which are the heaviest:
Actions to take:
It is not possible to know if amount of data is normal or not, it totally depends on the ingestion of data.
a) Ask the respective team to set right retention_period for indicesThe format of data is:
/opt/ca/jarvis/ao/<Tenant_id>/<doc_id>/<doc_version>/<year>/<month>/<date>/<actual data>
For example:
-You can delete the data as per your requirement, there is a delete button next to the folder
-You can delete whole month data like the one in above snapshot, i..e to delete March 2020
-Otherwise you can go inside each month and then delete the data per date like
2) How to reduce global data retention?
From:
"In the OpenShift Web Console,
go to the Digital Operational Intelligence project.
Go to Applications, Deployments, doireadserver.
Select Environment to view the environment variables.
Set the value of JARVIS_TENANT_RETENTION_PERIOD as needed.
Click Save."
3) How to reduce data retention for only specific indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days?
Verify the # of indices per product and size, in this example, uim and anomaly
http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds
2 Change the retention
a) Go to Jarvis API onboarding
b) List all tenants to identify the tenant details
GET /onboarding/tenants(product_id=’product_id’)
c) Display tenant configuration, for example:
Use GET /onboarding/tenants(product_id=’product_id’, tenant_id=’{tenant-id}’)
We can see the global retention across all indices for the tenant is 45 days
d) Update the tenant definition as below to align with the requirements in this example:
metrics_uim to 2 days
metrics_anomaly to 30 days
PATCH /onbaarding/tenants updatenant
{
"tenant_id": "<yourtenant>-USERSTORE",
"product_id": "ao",
"tenant_doc_type_details": [
{
"doc_type_id": "itoa_metrics_uim",
"doc_type_version": "1",
"retention_period": 2
},
{
"doc_type_id": "itoa_metrics_anomaly",
"doc_type_version": "1",
"retention_period": 30
}
],
"retention_period": 45
}
Result: 204 indicates success
Purge task is configured in the esutils deployment, in this example, it is configured to start at 9PM
Once maintenance is completed, you can verify the results by looking at:
- Elastic queries against the indices as illustrated above
- NFS directory: /var/nfs/doi/elasticsearch-data-1/indices
4) How to verify Elastic snapshot size and how to disable or reduce Elastic snapshots?
a)Identify ES backup folder:
{es_endpoint}/_snapshot/_all?pretty
For example:
b) Find out total space used:
Go to Elasticsearch pod
cd /opt/ca/elasticsearch/backup
du -sh
c) To free up some space delete content of /opt/ca/elasticsearch/backup folder and all subdirectories. This will not impact active Elastic data
cd /opt/ca/elasticsearch/backup
rm -rf *
VERY IMPORTANT: only delete data by going in the pod and directory /opt/ca/elasticsearch/backup and no other directory shall be touched.
d) To reduce snapshots setting:
Go to ESutils deployment, environments
Set "EXCLUDE_INDICES" = ao_.*
This will highly reduce the amount of space use for backup
IMPORTANT: If dot(.) is missed, snapshots will keep happening. We use Java-based Regex, so “.*” is required, using only “*” will not work.
e) To disable snapshots, just change the interval to some future year, for example:
Set “SNAPSHOT_CRON” = 0 0 23 * * ? 2035
NOTE: You could find that some indices might appear not to be deleted as expected, in the below example, cleaning took place on April 4, but you noticed that anomaly_1_6 was not deleted. Why?
Explanation: CDS column is the creation date, by design we check when the last document was inserted for this task. As the next rollover happened on 4th April i.e. for index _7, so the last document inserted in _6 index was on 4th April, so this index was not deleted.
5) How to delete specific indices for immediate action?
If for some reason you cannot wait for the purge maintenance and need to delete indices immediately to free up some space, execute:
curl -X DELETE http(s)://{es_endpoint}/<index_name>
For example you have the below indices: <ES_endpoint>/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds
and you want to delete the oldest indices, in this example: anomaly_1_9 and anomaly_1_10, uim_1_8 and um_1 _9
you will execute : curl -X DELETE http(s)://{es_endpoint}/<index_name>
for each of the below indices:
ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9