vco-pods crash with out of memory errors when workflows generate a high number of audit log events in VMware Aria Automation Orchestrator
search cancel

vco-pods crash with out of memory errors when workflows generate a high number of audit log events in VMware Aria Automation Orchestrator

book

Article ID: 322687

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This issue presents as vco-server-app pods continue to restart. This can be further qualified by either (or both) of these symptoms:

Symptom 1:

  • The /services-logs/prelude/vco-app/console-logs/vco-server-app.log file contains out of memory errors similar to:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /usr/lib/vco/app-server/../app-server/logs/vco_Datestamp_Timestamp_heap_dump.hprof ...
Heap dump file created [4568732636 bytes in 5.664 secs]
Terminating due to java.lang.OutOfMemoryError: Java heap space.
  • When describing the pods using the kubectl -n prelude get pods command the vco pods show a high number of restarts:

  • When querying the vco database, the vmo_clusterauditlog table contains a large count of audit logs for individual workflow execution IDs.

Symptom 2:

The vco-app pod in a VMware Aria Automation environment utilizing the Embedded Aria Orchestrator (vRO) is consistently restarting or failing its Startup probe, leading the pod to continue to reboot and preventing new workflows from running reliably.

When running the following to review the events of the pod:

kubectl describe pod vco-app-###### -n prelude

You may see the following error in the events:

Startup probe failed: Get "http://###.###.###.###:8280/vco/api/healthstatus?startupProbe=true": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Environment

VMware Aria Orchestrator 8.x

Cause

The issue can occur when workflows generate an abnormally large amount of audit logs.

Resolution

To determine if you are hitting this particular issue you will need to query the vCO database vmo_clusterauditlog table to see if there are any workflow execution runs that have generated a large amount of audit logs.

Caution: These steps execute SQL commands directly against the internal vRO database. Always ensure a successful backup or snapshot of the Aria Automation appliance is available before proceeding.

Backup your environment:

  1. You must back up all VMware Aria Automation or Orchestrator appliances, at the same time - simultaneously for all nodes.
  2. If you are making the snapshots manually, you must start the snapshots of the second and the third node not more than 40 seconds after you start the snapshots for the first node.
  3. If the quiesced state was not achieved for all 3 nodes within the ~40 seconds time frame, the following strings will be found in the logs: "Freeze synchronization failed" and "Sync failed, making inconsistent snapshot".
  4. Run the following command from one of the nodes to filter for all vmtoolsd messages: journal ctl --identifier=vmtoolsd
  5. When you back up the VMware Aria Automation or Orchestrator appliance, disable in-memory snapshots and enable quiescing. 

Validate and resolve: 

  1.  SSH to the Aria Orchestrator  appliance and login as root user.
  2. To connect to the vCO database:
    vracli dev psql vco-db
    Type yes when prompted.
  3. To count audit logs per workflow run execute the select query:
    SELECT COUNT(eventdata) AS occurrences, eventdata
    FROM vmo_clusterauditlog
    GROUP BY eventdata
    ORDER BY COUNT(eventdata) DESC;
  4. If the count returned for any workflow execution id is higher than 10,000 consider removing them:
    DELETE FROM vmo_clusterauditlog
    WHERE eventdata
    IN ('execution-id-1', 'execution-id-2', ...);

    Replace  execution-id-1 & execution-id-2 with the IDs identified in step 3.

If you see out of memory errors but no high count returned from query 3 above then consider increasing the default Aria Orchestrator Java heap memory

 

Additional Information

Japanese version: VMware Aria Automation Orchestrator において、ワークフローが大量の監査ログイベントを生成すると、vco-pods がメモリ不足エラーでクラッシュする