1. You see random nodes in the Aria Logs cluster spiking CPU and becoming disconnected.
2. You have rebooted the nodes and the issue persists.
3. df -h output in the nodes SSH session shows /storage/var is 100% full.
4. You track down in /storage/var directory by running the following command and finding li_heapdump.hprof is consuming the most space of the partition.
du -hscx * 2>/dev/null | sort -h
Aria Operations for Logs 8.18.3
When a lot of requests queuing in the cluster, the indexing is not able to keep up with the ingestion request, resulting in the in-memory queue getting filled, which results in the large heapdump file being generated. This eventually fills up the /storage/var partition.
When one or more partitions of the node are 100% full, there will be not enough space for services to process, eventually crashing, which will result in the node losing connection to the cluster.