Performance issues while non-uniform logs ingested generates a large number of event types
search cancel

Performance issues while non-uniform logs ingested generates a large number of event types

book

Article ID: 381195

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Non-uniform logs ingested on the node/cluster generates a large number of event types within the Machine Learning component.

The following symptoms can be noticed:

  • High CPU usage and/or high memory usage.
  • loginsight service frequently restarting.
  • The local-event-type-count counter reached the maximum default value of 75000.
    ('local-event-type-count' can be checked on the URL <LI_HOST>/internal/statz)

Environment

Aria Operations for Logs 8.x

Cause

Machine Learning queries require compute resources for running.
This can affect the performance when there are Machine Learning queries accumulating, in cases when Aria Logs is ingesting various/non-uniform logs and has to generate a large number of event types.

Resolution

Resolution:

Find the system sending the non-uniform logs to Aria Logs and limit or stop these types of events.

Workaround:

Reducing 'leo-max-leaders' parameter to reduce the number of event types the incoming log messages are grouped into. This would reduce the load on the cluster in cases where this count reaches its limit of 75000. The effect of this change would be increased fuzzy grouping of events under the "Event Types" tab.

  1. Disable Machine Learning:
    1. Open https://<loginsight>/internal/config, and check the "Show all settings" checkbox.
      Go to the end of the XML, and add the following sub-section into the <config></config> section:
      <leo enabled="false" />

      <leo-max-leaders value="20000" />

      <rex enabled="false" />
  2. Save the configuration and close the page.
  3. Open the same page and make sure your configuration still exists (it may have been moved into a different location, so you'll have to search for it via browser 'find' tool).
  4. Clear the Machine Learning data to help stabilise the cluster:
    1. ssh to the primary node.
    2. run `cqlsh-no-pass` from command line
    3. run the following commands in cqlsh console
      use machine_learning;

      truncate spock_cluster_counts;

      truncate spock_clusters;

      truncate spock_pattern_status;

      truncate spock_cluster_diffs;

      truncate spock_exclusive_tasks;

      truncate spock_patterns_v2;

      truncate spock_cluster_leases;

      truncate spock_global_queries_v2;

      truncate spock_pending_clusters;
    4. Restart the cluster nodes one by one
  5. Re-enable Machine Learning and update the 'leo-max-leaders' parameter:
    1. Open https://<loginsight>/internal/config, and check the "Show all settings" checkbox.
    2. Go to the end of the XML, and update the following sub-section into the <config></config> section: 

<leo enabled="true" />

<leo-max-leaders value="20000" />

<rex enabled="true" />

 

Save the configuration and close the page.

Open the same page and make sure your configuration still exists (it may have been moved into a different location, so you'll have to search for it via browser 'find' tool).

Restart all the nodes, one at a time.