Aria Operations for Logs nodes are rebooting intermittently

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Aria Operations primary or worker nodes are intermittently showing as disconnected with no evidence of Cassandra corruption and no evidence off too many open files in Cassandra logs
Slowness experienced in the environment
You may see the error 'Failed to load queries: java.net.SocketTimeoutException: Read timed out' when navigating to System Monitor page
Aria Operations for logs is correctly sized when checking sizing guidelines: Sizing the VMware Aria Operations for Logs Virtual Appliance
Error in /storage/core/loginsight/var/cassandra.log for the affected nodes is similar to :
- INFO [GossipStage:1] YYYY-MM-DDThh:mm:ss,467 Gossiper.java:1382 - InetAddress /xxx.xx.xx.xx:7000 is now DOWN
  INFO [Messaging-EventLoop-3-4] YYYY-MM-DDThh:mm:ss,460 NoSpamLogger.java:105 - /xxx.xx.xx.xx:7000->/xxx.xx.xx.xx:7000-URGENT_MESSAGES-[no-channel] failed to connect
  io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /xxx.xx.xx.xx:7000
  Caused by: java.net.ConnectException: Connection refused
- ERROR [main] YYYY-MM-DDThh:mm:ss,866 LogReplicaSet.java:194 Found too many lines for, giving up
  INFO [main] YYYY-MM-DDThh:mm:ss,863 LogTransaction.java:544 - Verifying logfile transaction [nb_txn_stream_uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts- ERROR [main] 2025-04-25T08:36:41,866 LogFile.java:164 Failed to read records for transaction log [nb_txn_stream_3uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_c ERROR [main] 2025-04-25T08:36:41,867 LogTransaction.java:559 - Unexpected disk state: failed to read transaction log [nb_txn_stream_uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine learni Files and contents follow: /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts-uuid/nb_txn_stream_uuid.log
  ABORT: [,0,0][737437348]
  ***This record should have been the last one in all replicas
  ADD: [/storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts-uuid/nb-617707-big-,0,8][674825114]
  ERROR [main] YYYY-MM-DDThh:mm:ss,869 CassandraDaemon.java:900 - Cannot remove temporary or obsoleted files for machine_learning.spock_cluster_counts due to a problem with transaction log files. Please check records with problems in the 1 (END)
- WARN [ReadStage-2] YYYY-MM-DDThh:mm:ss,950 ReadCommand.java:605 - Read 0 live rows and 8011 tombstone cells for query SELECT * FROM machine_learning.spock_global_queries_v2 WHERE bucket = 0 LIMIT 5000 ALLOW FILTERING; token -xxxxxxxxxxxxxxxxx (see tombstone_warn_threshold)

Environment

Aria Operations for logs 8.x

Cause

Although Aria Operations for Logs is sized correctly, it is ingesting logs from multiple sources, resulting in a high variety of event types. This leads to excessive write activity on the Cassandra database, causing the Cassandra services to go down.

Resolution

To resolve this issue:

Take snapshots of all nodes in the cluster ( without memory and without quiesce )
Log in to Aria Operations for logs https://<AriaOperationsforLogs_Hostname_Or_IpAddress>/internal/config > ,enable the "show all settings" checkbox
Locate the following entries :
- <leo-threshold value="0.6" />
- <leo-max-leaders value="75000" />
Update the values to:
- <leo-threshold value="0.7" />
- <leo-max-leaders value="20000" />
Save changes
Monitor the environment. If issue persists, it is recommended to increase the sizing of the cluster - Sizing the VMware Aria Operations for Logs Virtual Appliance
If the problem continues even after resizing, contact Broadcom Support and include a reference to this KB article.

Aria Operations for Logs nodes are rebooting intermittently

Article ID: 382001

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback