Aria Operations for Logs nodes are rebooting intermittently
search cancel

Aria Operations for Logs nodes are rebooting intermittently

book

Article ID: 382001

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Aria Operations primary or worker nodes are intermittently showing as disconnected with no evidence of Cassandra corruption and no evidence off too many open files in Cassandra logs
  • Slowness experienced in the environment
  • You may see the error 'Failed to load queries: java.net.SocketTimeoutException: Read timed out' when navigating to System Monitor page
  • Aria Operations for logs is correctly sized when checking sizing guidelines: Sizing the VMware Aria Operations for Logs Virtual Appliance
  • Error in /storage/core/loginsight/var/cassandra.log for the affected nodes is similar to :
    • INFO  [GossipStage:1] 2024-09-02T08:59:10,467 Gossiper.java:1382 - InetAddress /xxx.xx.xx.xx:7000 is now DOWN
      INFO  [Messaging-EventLoop-3-4] 2024-09-02T08:59:11,460 NoSpamLogger.java:105 - /xxx.xx.xx.xx:7000->/xxx.xx.xx.xx:7000-URGENT_MESSAGES-[no-channel] failed to connect
      io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /xxx.xx.xx.xx:7000
      Caused by: java.net.ConnectException: Connection refused

Environment

Aria Operations for logs 8.x

 

Cause

Aria Operations for Logs is sized correctly but Aria Operations is ingesting logs for multiple sources, resulting in multiple different event types, this causes too many writes to Cassandra database resulting in Cassandra services going down.

 

Resolution

  1. Take snapshot of all nodes in the cluster without memory and without quiesce
  2. Sign into Aria Operations for logs https://<AriaOperationsforLogs_Hostname_Or_IpAddress>/internal/config > Select checkbox for Show all settings
  3. Locate the entries :
    • <leo-threshold value="0.6" /> 
    • <leo-max-leaders value="75000" />
  4. Change these values to
    • <leo-threshold value="0.7" />
    • <leo-max-leaders value="20000" />
  5. Save changes
  6. Monitor and if issue persists recommend increasing the sizing of the cluster - Sizing the VMware Aria Operations for Logs Virtual Appliance
  7. If the issue continues please contact Broadcom support and provide a link to this KB article