Random nodes are spiking CPU and becoming unresponsive

search cancel

Random nodes are spiking CPU and becoming unresponsive

book

Article ID: 400255

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

1. You see random nodes in the Aria Logs cluster spiking CPU and becoming disconnected.

2. You have rebooted the nodes and the issue persists.

3. df -h output in the nodes SSH session shows /storage/var is 100% full.

4. You track down in /storage/var directory by running the following command and finding li_heapdump.hprof is consuming the most space of the partition.

du -hscx * 2>/dev/null | sort -h

Environment

Aria Operations for Logs 8.18.3

Cause

When a lot of requests queuing in the cluster, the indexing is not able to keep up with the ingestion request, resulting in the in-memory queue getting filled, which results in the large heapdump file being generated. This eventually fills up the /storage/var partition.

When one or more partitions of the node are 100% full, there will be not enough space for services to process, eventually crashing, which will result in the node losing connection to the cluster.

Resolution

Increase the amount of live storage in the cluster
- Increase the Storage Capacity of the VMware Aria Operations for Logs Virtual Appliance
- Add a Worker Node to a VMware Aria Operations for Logs Cluster
Reduce the number of ingested logs being stored in live storage
- Add a Log Filter Configuration
Enable data archiving in order to restore logs removed from live storage
- Data Archiving

Feedback

thumb_up Yes

thumb_down No