Random nodes are spiking CPU and becoming unresponsive
search cancel

Random nodes are spiking CPU and becoming unresponsive

book

Article ID: 400255

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

1. You see random nodes in the Aria Logs cluster spiking CPU and becoming disconnected.

2. You have rebooted the nodes and the issue persists.

3. df -h output in the nodes SSH session shows /storage/var is 100% full.

4. You track down in /storage/var directory by running the following command and finding li_heapdump.hprof is consuming the most space of the partition.

    du -hscx * 2>/dev/null | sort -h

 

Environment

Aria Operations for Logs 8.18.3

Cause

When a lot of requests queuing in the cluster, the indexing is not able to keep up with the ingestion request, resulting in the in-memory queue getting filled, which results in the large heapdump file being generated. This eventually fills up the /storage/var partition. 

When one or more partitions of the node are 100% full, there will be not enough space for services to process, eventually crashing, which will result in the node losing connection to the cluster.

Resolution