Aria Operations for logs /storage/core partition is 100% on nodes causing cassandra DB failing to start
search cancel

Aria Operations for logs /storage/core partition is 100% on nodes causing cassandra DB failing to start

book

Article ID: 301461

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Aria Operations for logs deletes old buckets when available space on the /storage/core partition is less than 3%. Deletion is done using a FIFO model. This partition should never reach 100% because Aria Operations for logs manages that partition.
  • If the storage/core has already reached 100% it as the potential to take the node offline. 
    • Upon on checking the file system free space, you might see something similar to below

      Filesystem Size Used Avail Use% Mounted on
      /dev/sda3 16G 2.4G 13G 16% /
      udev 7.9G 112K 7.9G 1% /dev
      tmpfs 7.9G 648K 7.9G 1% /dev/shm
      /dev/sda1 128M 38M 84M 31% /boot
      /dev/mapper/data-var 20G 7.3G 12G 39% /storage/var
      /dev/mapper/data-core 483G 483G 0 100% /storage/core



Environment

VMware Aria Operations for Logs 8.x
VMware vRealize Log Insight 4.x

Cause

The /storage/core partition reaching 100% is often caused by issues with the NFS archive configuration or excessively high log injection rates.

While the current NFS setup may seem correct, it does not ensure uninterrupted archiving. Prolonged NFS unavailability (over 10 minutes) or a full archive storage can interrupt the archiving process, leading to degraded Log Insight performance or a complete service outage, as documented. To prevent such issues, regular and proactive NFS maintenance is crucial.

Resolution

NOTE: Make sure to take a snapshot of the Aria Operations for logs nodes before proceeding. 
 
Important: Before starting with below steps, ensure to validate the stored buckets

Steps:
  1. Run the below command on both affected node.

Note: This might take a while for the command to finish running. The results of the command will be saved in the /tmp/validate.txt file:

cd /usr/lib/loginsight/application/sbin ./validate-bucket --validate > /tmp/validate.txt

      2.Once the validate is completed refer the /tmp/validate.txt and make sure no corrupted buckets. If any follow Phase 1 of the action plan, if there are no corrupted buckets follow Phase 2. 


Phase 1:

1. stop log insight
     /etc/init.d/loginsight stop
2. list the buckets, Look for the guid of the old buckets by timestamp
    /usr/lib/loginsight/application/sbin/bucket-index show
3. Run the following command to delete a bucket
    /usr/lib/loginsight/application/sbin/bucket-index delete [BUCKET-ID]
4. start loginsight
   /etc/init.d/loginsight start

The "./validate-bucket --validate" would reveal if there are any corrupted buckets remove those instead of looking for the oldest buckets.

Phase 2:

1. Deploy a new Aria Operations for logs node following the steps outlined in this document.

2. Once you have the new Aria Operations for logs node deployed, the next is to import the buckets from old node into the new one without losing any data. Steps below.

    a. SSH to Log Insight you want to import events from and go to “/storage/core/loginsight/cidata/store” – this is where the data buckets live.
    b. Run “service loginsight stop” to stop Log Insight. Ensure the service has stopped by running “service loginsight status”.
    c. Copy the buckets you want to import to target Log Insight - the destination directory must be the same; i.e. “/storage/core/loginsight/cidata/store”
    d. SSH to Log Insight that will be importing the events and stop the service by running “service loginsight stop”. Ensure the service has stopped by running “service loginsight status”.
    e. Run “/usr/lib/loginsight/application/sbin/bucket-index add <bucket_id>”
    f. Repeat the step above for all the copied buckets.

3. Ensure you complete the steps for both the current worker nodes to new nodes.
4. Once the above is completed. Join the 2 new nodes to the existing Deployment following this document

Additional Information

Impact/Risks:
If storage/core has already reached 100% will take the node offline.