This information describes the symptoms, cause, impact, and resolution of the issue where Aria Operations for Logs nodes become disconnected due to high root partition utilization caused by a large .hprof file.
Symptoms:
Error authenticating user"FAILED: Unable to get user data. Possible cassandra is downSite can’t be reached – ERR_CONNECTION_REFUSEDdf -h" on the respective nodes you see that the root partition is full..hprof" java heap dump with a very large file size and growing.journalctl logs you see log entries about NTP time drift.admin or root password gives below error passwd: Authentication token manipulation error passwd: password unchanged
/dev/sda5 is at 100%
VMware Aria Operations for Logs 8.14.x and later
The heap dump is created when a Java process sees an issue with a running process. In this case, because a node suffers from time drift the loginsight service keeps crashing.
File(s) from the log directory occupies unusually high disk space. This could be due to issues with log file rotation.
root due to an expired password that can't be updated due to root partition being 100%:Find the hprof file using the below command:
find / -name \*.hprof -exec ls -lah {} \;
NOTE: The command may appear to hang for up to a couple minutes or more but it should eventually complete with any results found. The hprof files will normally be in the /usr/lib/loginsight directory.
Optional: You can copy .hprof files to your local machine before deleting them.
Then you can delete the .hprof file(s), if any and restart the loginsight service by running
service loginsight restart
NOTE: If you are in single user mode at this point due to an expired root password, continue with the remaining steps on How to reset the root password in VMware Aria Operations for Logs (formerly VMware vRealize Log Insight)
To ensure that this issue doesn't happen again ensure that NTP server is reachable from the node and that the time on all the nodes in the cluster has only a couple of seconds time difference.
If there are no heapdump files or if issue does not get resolved even after removing heapdump files, follow the steps below:
Go to root directory and run du -sh * command to check which directory is occupying space and check if you can remove anything from that folder.
Example: In one case, var directory was showing highest file size- in /var/log we found that messages file was of unusually large size.
Ran "> messages" command to truncate the file after which root partition cleared up space.
Related KB(s):
The Aria Operations for Logs root partition is full