Aria Operation for Logs shows disconnected and has 100% utilization on root partition
search cancel

Aria Operation for Logs shows disconnected and has 100% utilization on root partition

book

Article ID: 312243

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This information describes the symptoms, cause, impact, and resolution of the issue where Aria Operations for Logs nodes become disconnected due to high root partition utilization caused by a large .hprof file.

Symptoms:

  • One or More nodes show disconnected in the UI.
  • Login to the UI fails indicating an issue with password - may display "Error authenticating user"
  • Resetting the admin password on root ssh gives error similar to FAILED: Unable to get user data. Possible cassandra is down
  • UI becomes inaccessible after reboot 
  • Aria Operations for Logs is unreachable
  • When attempting to browse to the login page, no page is displayed and you may see the message:

    Site can’t be reached – ERR_CONNECTION_REFUSED

  • On checking the output of "df -h" on the respective nodes you see that the root partition is full.
  • On checking which file is filling up the partition we see a ".hprof" java heap dump with a very large file size and growing.
  • On checking journalctl logs you see log entries about NTP time drift.
  • Attempts to reset admin or root password gives below error 

                      passwd: Authentication token manipulation error
           passwd: password unchanged

           /dev/sda5 is at 100%

Environment

VMware Aria Operations for Logs 8.14.x and later

Cause

  • The heap dump is created when a Java process sees an issue with a running process. In this case, because a node suffers from time drift the loginsight service keeps crashing.

  • File(s) from the log directory occupies unusually high disk space. This could be due to issues with log file rotation.

Resolution

  • Login as root to each Aria Operations for Logs appliance via SSH. If you are not able to login as root due to an expired password that can't be updated due to root partition being 100%:

  • Find the hprof file using the below command:

    • find / -name \*.hprof -exec ls -lah {} \;

      NOTE: The command may appear to hang for up to a couple minutes or more but it should eventually complete with any results found. The hprof files will normally be in the /usr/lib/loginsight directory.  

Optional: You can copy .hprof files to your local machine before deleting them.

  • Then you can delete the .hprof file(s), if any and restart the loginsight service by running 

  • Validate you are now able to access the Aria Operations for Logs application
  • To ensure that this issue doesn't happen again ensure that NTP server is reachable from the node and that the time on all the nodes in the cluster has only a couple of seconds time difference.

If there are no heapdump files or if issue does not get resolved even after removing heapdump files, follow the steps below:

Go to root directory and run du -sh * command to check which directory is occupying space and check if you can remove anything from that folder.

Example: In one case, var directory was showing highest file size- in /var/log we found that messages file was of unusually large size.

Ran "> messages" command to truncate the file after which root partition cleared up space.

Additional Information

Impact/Risks:
  • Aria Operations for Logs nodes become unavailable, which affects the collection and analysis of logs.
  • High disk utilization can lead to performance degradation and potential data loss.

Related KB(s):
The Aria Operations for Logs root partition is full

How to reset the root password in VMware Aria Operations for Logs (formerly VMware vRealize Log Insight)