PostgreSQL Pod intermittently crash in Aria Automation 8.18.x

search cancel

PostgreSQL Pod intermittently crash in Aria Automation 8.18.x

book

Article ID: 399735

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

PostgreSQL pods in Aria Automation intermittently crash and restart every few days.
High system load and etcd heartbeat failures are observed during these events.
Disk usage on /data partition reaches 74%, leaving limited space for stable DB operations.

In the /services-logs/prelude/Postgres logs-

[DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[DEBUG] is_server_available(): ping status for "host=postgres-0.postgres.domainname dbname=repmgr-db user=repmgr-db passfile=/run/repmgr-db.cred connect_timeout=10
keepalives=1" is PQPING_NO_RESPONSE

va#####.srv.axxxxxxxz etcd[######3]: failed to send out heartbeat on time (exceeded the 100ms timeout for 1.940391769s, to cf5f60d36079f947)
server is likely overloaded
I0408 03:13:08.363531 2418766 trace.go:236] Trace[408016628]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:8cd3ab6b-19ef-4583-b631-6deb2dc2ba9b,client:127.0.0.1,api-group:,api-version:v1,name:ccs-gateway-tracing-config,subresource:,namespace:prelude,protocol:HTTP/2.0,resource:configmaps,scope:resource,url:/api/v1/namespaces/prelude/configmaps/ccs-gateway-tracing-config,user-agent:kubelet/v1.30.2+vmware.3 (linux/amd64) kubernetes/7f6a4bf,verb:GET (08-Apr-2025 03:13:07.260) (total time: 1081ms):
Trace[408016628]: ---"Writing http response done" 1081ms (03:13:08.341)

Environment

VMware Aria Automation 8.18.x

Cause

The node exceeded 90% memory usage (make: *** [/opt/health/Makefile:73: memory-usage] Error 1), impacting overall performance and delaying PostgreSQL pod responses. It is assumed that the current master PostgreSQL pod was unable to respond to ping requests within the 10-second timeout due to memory exhaustion, triggering an automatic failover.

Resolution

To resolve the issue we have the following options:

1. Option 1: Increase the physical RAM with the RAM custom profile adds to.

Currently
Mem: 94Gi 80Gi 2.1Gi 12Gi 24Gi 13Gi
Swap: 0B 0B 0B

As vRO in XL profile standard setting is 15 GB total for vRO and Polyglot runner compared to customer custom profile 40GB total for vRO and Polyglot runner, the recommendation is to increase all the VAs physical memory to 94 + 40 = 134GB

2. Option 2: Move to external vRO.

NOTE: The custom profile customization is supported only for external vRO not embedded stated here: How to Scale the Heap Memory Size of the Automation Orchestrator Server.

Feedback

thumb_up Yes

thumb_down No