Aria Operations for Logs nodes are rebooting intermittently
search cancel

Aria Operations for Logs nodes are rebooting intermittently

book

Article ID: 382001

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Aria Operations primary or worker nodes are intermittently showing as disconnected with no evidence of Cassandra corruption and no evidence off too many open files in Cassandra logs
  • Slowness experienced in the environment
  • You may see the error 'Failed to load queries: java.net.SocketTimeoutException: Read timed out' when navigating to System Monitor page
  • Aria Operations for logs is correctly sized when checking sizing guidelines: Sizing the VMware Aria Operations for Logs Virtual Appliance
  • Error in /storage/core/loginsight/var/cassandra.log for the affected nodes is similar to :
    • INFO  [GossipStage:1] YYYY-MM-DDThh:mm:ss,467 Gossiper.java:1382 - InetAddress /xxx.xx.xx.xx:7000 is now DOWN
      INFO  [Messaging-EventLoop-3-4] YYYY-MM-DDThh:mm:ss,460 NoSpamLogger.java:105 - /xxx.xx.xx.xx:7000->/xxx.xx.xx.xx:7000-URGENT_MESSAGES-[no-channel] failed to connect
      io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /xxx.xx.xx.xx:7000
      Caused by: java.net.ConnectException: Connection refused

    • ERROR [main] YYYY-MM-DDThh:mm:ss,866 LogReplicaSet.java:194 Found too many lines for, giving up
      INFO [main] YYYY-MM-DDThh:mm:ss,863 LogTransaction.java:544 - Verifying logfile transaction [nb_txn_stream_uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts- ERROR [main] 2025-04-25T08:36:41,866 LogFile.java:164 Failed to read records for transaction log [nb_txn_stream_3uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_c ERROR [main] 2025-04-25T08:36:41,867 LogTransaction.java:559 - Unexpected disk state: failed to read transaction log [nb_txn_stream_uuid.log in /storage/core/loginsight/cidata/cassandra/data/machine learni Files and contents follow: /storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts-uuid/nb_txn_stream_uuid.log
      ABORT: [,0,0][737437348]
      ***This record should have been the last one in all replicas
      ADD: [/storage/core/loginsight/cidata/cassandra/data/machine_learning/spock_cluster_counts-uuid/nb-617707-big-,0,8][674825114]
      ERROR [main] YYYY-MM-DDThh:mm:ss,869 CassandraDaemon.java:900 - Cannot remove temporary or obsoleted files for machine_learning.spock_cluster_counts due to a problem with transaction log files. Please check records with problems in the 1 (END)

    • WARN  [ReadStage-2] YYYY-MM-DDThh:mm:ss,950 ReadCommand.java:605 - Read 0 live rows and 8011 tombstone cells for query SELECT * FROM machine_learning.spock_global_queries_v2 WHERE bucket = 0 LIMIT 5000 ALLOW FILTERING; token -xxxxxxxxxxxxxxxxx (see tombstone_warn_threshold)

Environment

Aria Operations for logs 8.x

 

Cause

Although Aria Operations for Logs is sized correctly, it is ingesting logs from multiple sources, resulting in a high variety of event types. This leads to excessive write activity on the Cassandra database, causing the Cassandra services to go down.

 

Resolution

To resolve this issue:

  1. Take snapshots of all nodes in the cluster ( without memory and without quiesce )
  2. Log in to Aria Operations for logs https://<AriaOperationsforLogs_Hostname_Or_IpAddress>/internal/config > ,enable the "show all settings" checkbox 
  3. Locate the following entries :
    • <leo-threshold value="0.6" /> 
    • <leo-max-leaders value="75000" />
  4. Update the values to:
    • <leo-threshold value="0.7" />
    • <leo-max-leaders value="20000" />
  5. Save changes
  6. Monitor the environment. If issue persists, it is recommended to increase the sizing of the cluster - Sizing the VMware Aria Operations for Logs Virtual Appliance
  7. If the problem continues even after resizing, contact Broadcom Support and include a reference to this KB article.