DX OI - How to reduce Elastic data retention - NFS or Elastic Drive Full

book

Article ID: 188786

calendar_today

Updated On:

Products

DX Operational Intelligence

Issue/Introduction


Symptoms
:

- No alarms or metrics from integration products
- No tickets being created in Servicenow
- Cannot access Openshift/DOI Console.
- Pods crashing
- Possible data corruption
- NFS file system full


Questions
:

1) How to identify if the disk space is related to Elastic Data or HDFS?

2) How to reduce global data retention?

3) How to reduce data retention for only specific indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days?

4) How to verify Elastic snapshot size and how to disable or reduce Elastic snapshots?

5) How to delete specific indices for immediate action?

Environment

CA DIGITAL OPERATIONAL INTELLIGENCE - 1.3.x

Resolution

1) How to identify if the disk space usage issue is related to Elastic Data or HDFS?

For ELASTIC:

You can use the below ES queries for your analysis

a) Check Elastic allocated disk space and availability:

{es_endpoint}/_cluster/stats?pretty&human&filter_path=**.fs

For example: this query will clearly tell you if ES is out of disk space


b) Check Elastic Status:

{es_endpoint}/_cluster/health?pretty&human

For example:


c) Check indices by size:

{es_endpoint}/_cat/indices/?v&s=ss:desc&h=health,store.size,pri.store.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

For example:

 

Action to take:

-Increase disk space in NFS server
-Reduce data retention as documented in next points

 

For HDFS:

Hadoop is used to store CPA, PI, and level1 and level2 aggregation data. It is like the data where jobs run.

1. Go to namenode pod, terminal
2. cd /opt/ca/hadoop-2.9.1/bin
3. Execute the command 
"hdfs dfs -du  /opt/ca/jarvis/ao | sort -n -r -k 1" 
and check which tenant is consuming the highest 

For example:



The top 5 can be taken, then run the below command for each of tenant like:

"hdfs dfs -du  /opt/ca/jarvis/ao/<some folder> | sort -n -r -k 1"

this command will give us the doc types which are the heaviest:

Actions to take:

It is not possible to know if amount of data is normal or not, it totally depends on the ingestion of data.

a) Ask the respective team to set right retention_period for indices
b) Manually delete unwanted data: once you identify the data to clean up as described above, go to Hadoop admin UI to delete oldest Hadoop data.
You cannot directly use rm -rf as these are HDFS folders:

Go to namenode route, if it doesn't exist follow below steps:

-Go to Openshift console > Applications> Services > namenode
-Click Action > Create Route
-Set hostname = namenode.<routerIP>

You might need to update your DNS or /etc/hosts file with the new route


Access the Hadoop namenode endpoint


The format of data is:

/opt/ca/jarvis/ao/<Tenant_id>/<doc_id>/<doc_version>/<year>/<month>/<date>/<actual data>

For example:

-You can delete the data as per your requirement, there is a delete button next to the folder
-You can delete whole month data like the one in above snapshot, i..e to delete March 2020 
-Otherwise you can go inside each month and then delete the data per date like



2) How to reduce global data retention?

From:

https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/it-operations-management/digital-operational-intelligence/1-3-2/configuring/configure-data-retention.html#concept.dita_839dd06d505c4ff53073e2aa839ba96183dbd896_ConfigureDataRetentionSpecific

"In the OpenShift Web Console,

go to the Digital Operational Intelligence project.

Go to Applications, Deployments, doireadserver.

Select Environment to view the environment variables.

Set the value of JARVIS_TENANT_RETENTION_PERIOD as needed.

Click Save."

 

3) How to reduce data retention for only specific indices, for example:
metrics_uim to 2 days
metrics_anomaly to 30 days?

 

Verify the # of indices per product and size, in this example, uim and anomaly

http:// {es_endpoint}/_cat/indices/*uim*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

http:// {es_endpoint}/_cat/indices/*anomaly*?v&s=cd:desc&h=h,,ps.size,pri,rep,store.size,pri.store.size,docs.count,docs.deleted,index,cds

2 Change the retention

a) Go to Jarvis API onboarding


b) List all tenants to identify the tenant details

GET /onboarding/tenants(product_id=’product_id’)


c) Display tenant configuration, for example:

Use GET /onboarding/tenants(product_id=’product_id’, tenant_id=’{tenant-id}’)

We can see the global retention across all indices for the tenant is 45 days

 

d) Update the tenant definition as below to align with the requirements in this example:

metrics_uim to 2 days
metrics_anomaly to 30 days

PATCH /onbaarding/tenants updatenant


{
  "tenant_id": "<yourtenant>-USERSTORE",
  "product_id": "ao",
  "tenant_doc_type_details": [
    {
      "doc_type_id": "itoa_metrics_uim",
      "doc_type_version": "1",
      "retention_period": 2
    },
    {
      "doc_type_id": "itoa_metrics_anomaly",
      "doc_type_version": "1",
      "retention_period": 30
    }
  ],
  "retention_period": 45
}

 

Result: 204 indicates success

e) Verify results:

Purge task is configured in the esutils deployment, in this example, it is configured to start at 9PM

Once maintenance is completed, you can verify the results by looking at:

- Elastic queries against the indices as illustrated above
- NFS directory: /var/nfs/doi/elasticsearch-data-1/indices

 

4) How to verify Elastic snapshot size and how to disable or reduce Elastic snapshots?

a)Identify ES backup folder:

{es_endpoint}/_snapshot/_all?pretty

For example:



b) Find out total space used:

Go to Elasticsearch pod
cd /opt/ca/elasticsearch/backup
du -sh


c) To free up some space delete content of /opt/ca/elasticsearch/backup folder and all subdirectories. This will not impact active Elastic data

cd /opt/ca/elasticsearch/backup
rm -rf *

VERY IMPORTANT: only delete data by going in the pod and directory /opt/ca/elasticsearch/backup and no other directory shall be touched.

d) To reduce snapshots setting:

Go to ESutils deployment, environments
Set "EXCLUDE_INDICES" = ao_.* 

This will highly reduce the amount of space use for backup

IMPORTANT: If dot(.) is missed, snapshots will keep happening. We use Java-based Regex, so “.*” is required, using only “*” will not work.

e) To disable snapshots, just change the interval to some future year, for example:
Set “SNAPSHOT_CRON” = 0 0 23 * * ? 2035


NOTE: You could find that some indices might appear not to be deleted as expected, in the below example, cleaning took place on April 4, but you noticed that anomaly_1_6 was not deleted. Why?



Explanation: CDS column is the creation date, by design we check when the last document was inserted for this task. As the next rollover happened on 4th April i.e. for index _7, so the last document inserted in _6 index was on 4th April, so this index was not deleted.


5) How to delete specific indices for immediate action?

If for some reason you cannot wait for the purge maintenance and need to delete indices immediately to free up some space, execute:

curl -X DELETE http(s)://{es_endpoint}/<index_name>

For example you have the below indices: <ES_endpoint>/_cat/indices/*metrics*?s=index,cds&h=index,ss,cds


and you want to delete the oldest indices, in this example: anomaly_1_9 and anomaly_1_10, uim_1_8 and um_1 _9

you will execute : curl -X DELETE http(s)://{es_endpoint}/<index_name>

for each of the below indices:

ao_itoa_metrics_anomaly_1_9
ao_itoa_metrics_anomaly_1_10
ao_itoa_metrics_uim_1_8
ao_itoa_metrics_uim_1_9

Additional Information


KB 189001
DX OI - Unable to PATCH tenant using Jarvis API - "Update of Tenant in LDDS failed, please contact administrator"
https://knowledge.broadcom.com/external/article?articleId=189001

Attachments