Aria Operations for logs /storage/core partition is 100% on nodes causing cassandra DB failing to start

Products

VMware Aria Suite

Issue/Introduction

Symptoms:
*Aria Operations for logs deletes old buckets when available space on the /storage/core partition is less than 3%. Deletion is done using a FIFO model. This partition should never reach 100% because Aria Operations for logs manages that partition.
*But in some cases if the storage/core has already reached 100% will take the node offline.
When you check the file system free space, you see content similar to:

Filesystem Size Used Avail Use% Mounted on
/dev/sda3 16G 2.4G 13G 16% /
udev 7.9G 112K 7.9G 1% /dev
tmpfs 7.9G 648K 7.9G 1% /dev/shm
/dev/sda1 128M 38M 84M 31% /boot
/dev/mapper/data-var 20G 7.3G 12G 39% /storage/var
/dev/mapper/data-core 483G 483G 0 100% /storage/core

Environment

VMware Aria Operations for Logs 8.x
VMware vRealize Log Insight 4.x

Cause

* See if the nfs archive is configured properly and working fine with below pointers.
1. NFS server reachable by Log Insight?
2. Is there sufficient space available?
3. Have permissions been properly configured to allow Log Insight to write and access the NFS server?
4. Is there sufficient end-to-end NFS throughput between the Log Insight appliance and the NFS server?

* Nfs is configured properly but that isn't enough to say that we will not face to any problem with archiving because if somehow nfs will become unavailable(even more than 10 minutes) or the archive storage will become full the same issue will come on. In general, LI grinding to a halt is documented and expected behavior if NFS is not maintained.

Resolution

Important: Before starting with below steps, ensure to validate the stored buckets

Steps:
1. Run the below command on both affected node.
Note: This might take a while for the command to finish running. The results of the command will be saved in the /tmp/validate.txt file:
cd /usr/lib/loginsight/application/sbin ./validate-bucket --validate > /tmp/validate.txt
2. Once the validate is completed refer the /tmp/validate.txt and make sure no corrupted buckets. If any follow Phase 1 of the action plan, if there are no corrupted buckets follow Phase 2.

Phase 1:

1. stop log insight
/etc/init.d/loginsight stop
2. list the buckets, Look for the guid of the old buckets by timestamp
/usr/lib/loginsight/application/sbin/bucket-index show
3. Run the following command to delete a bucket
/usr/lib/loginsight/application/sbin/bucket-index delete [BUCKET-ID]
4. start loginsight
/etc/init.d/loginsight start

The "./validate-bucket --validate" would reveal if there are any corrupted buckets remove those instead of looking for the oldest buckets.

Phase 2:

1. Deploy a new Aria Operations for logs node following the steps outlined in below document.
https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.18/Getting-Started-Operations-for-Logs/GUID-950D4050-ABA2-4375-A381-47187DC8C674.html

2. Once you have the new Aria Operations for logs node deployed, the next is to import the buckets from old node into the new one without losing any data. Steps below.

1)ssh to Log Insight you want to import events from and go to “/storage/core/loginsight/cidata/store” – this is where the data buckets live.
2)run “service loginsight stop” to stop Log Insight. Ensure the service has stopped by running “service loginsight status”.
3)copy the buckets you want to import to target Log Insight - the destination directory must be the same; i.e. “/storage/core/loginsight/cidata/store”
4)ssh to Log Insight that will be importing the events and stop the service by running “service loginsight stop”. Ensure the service has stopped by running “service loginsight status”.
5)run “/usr/lib/loginsight/application/sbin/bucket-index add <bucket_id>”
6)Repeat the step above for all the copied buckets.

3. Ensure you complete the steps for both the current worker nodes to new nodes.
4. Once the above is completed. Join the 2 new nodes to the existing Deployment following the below document
https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.18/Administering-Operations-for-Logs/GUID-B793B5C7-C856-4324-8202-EBB35265BA7B.html

Additional Information

Impact/Risks:
If storage/core has already reached 100% will take the node offline.