Event Broker Service pods continually crash then restart causing Node(s) to stop and restart

search cancel

Event Broker Service pods continually crash then restart causing Node(s) to stop and restart

book

Article ID: 314724

calendar_today

Updated On: 04-28-2025

Products

VMware Aria Suite

Issue/Introduction

Symptoms

Aria Automation requests fail with:

Failed to publish event to topic: Deployment resource action requested

Failed to publish event to topic: Deployment requested

"Failed to publish event to topic: Deployment resource action requested" or requests do not proceed past the "INITIALIZATION_IN_PROGRESS" stage.
Requests may not proceed past the "INITIALIZATION_IN_PROGRESS" stage.

INITIALIZATION_FAILED

Event Broker Service (ebs-app) pods are crashing / restarting after some time.

The RabbitMQ logs located under /var/log/services-logs/prelude/rabbitmq-ha-0/file-logs/rabbitmq-ha.log contain memory resource limit alarms similar to:

2024-04-11 03:31:38.592678+00:00 [info] <0.523.0> vm_memory_high_watermark clear. Memory used:1022529536 allowed:1024000000
2024-04-11 03:31:38.593100+00:00 [warning] <0.521.0> memory resource limit alarm cleared on node 'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local'
2024-04-11 03:31:38.593177+00:00 [warning] <0.521.0> memory resource limit alarm cleared across the cluster
2024-04-11 03:31:39.594883+00:00 [info] <0.523.0> vm_memory_high_watermark set. Memory used:1034166272 allowed:1024000000
2024-04-11 03:31:39.595221+00:00 [warning] <0.521.0> memory resource limit alarm set on node 'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local'.

Note: On the other nodes in the Aria Automation cluster, the RabbitMQ logs are kept at:
/var/log/services-logs/prelude/rabbitmq-ha-1/file-logs/rabbitmq-ha.log
/var/log/services-logs/prelude/rabbitmq-ha-2/file-logs/rabbitmq-ha.log

Due to this anomalies in RabbitMQ service, this might result in the deployments getting stuck in "Request in Progress".

Environment

VMware Aria Automation 8.x

Cause

The default memory assignment of 1GB for the RabbitMQ component may be insufficient.
RabbitMQ was migrated to using Quorum Queues instead of Mirror queues, since the latter was deprecated. This however raises the memory requirements. Customers in larger environments are more likely to run into this issue.

Resolution

Prerequisites

Create a snapshot of the Aria Automation nodes using a Day 2 Operation with Aria Suite Lifecycle.

Procedure: Increase VM and RabbitMQ memory allocation

Shut down the Aria Automation cluster using a Day 2 operation from Aria Suite Lifecycle.
Login to vCenter and increase the memory of each Aria Automation virtual machine appliance by 1GB.
Power on the Aria Automation appliance nodes and SSH into one node in the cluster.
Apply the configuration by running:
vracli cluster exec -- bash -c "base64 -d <<< '/Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AOxAXhdABGIBOkJeg/QIyaVI9J6wrAp1rezelhCpStNdFpnEPpn3HE3NIKUz/XPNckpYqB4dmL9sez8SMlRunU1o6W08AHGeZKNB1JZCgj3kL3qZoQ6LQ9wD8BNnQU8nOvkMAVON/QUWCTo//FHADweFOMd9N7vmcgk1L/CdCPO+0P5T7+hMeJggXwOh5Yfr03fCMWLPEUgUW1lAv6eDKrkYqb70lAZrfZISDKxRkYEHp60E9v5ikeGaRY+W89oDIs7hkanCRbfdUKeA4cGxWrJGF0GaRwC74G0xGMxl2DI44zOUoIvZ5cJDfDVV5zg8wc7bPjWkDS5CLFmmowMDIQ+Kp1zCGOsmWLIk/jnJEuUA/TQkliBV2vQqBZasuvKe7JslHwLiCXFY8WEk6Gkip6k774xIkNchkL27WkGCqiu8xTOw5sC3DgxX/PAXRvybkT95Lgzr+tWp95dP39iolMHfLDH7flMQlkjVS3cU8Mdhcb5ryrRSGrhP2b/7QoAOevO445x09sAAZQDsgcAAMxqNrCxxGf7AgAAAAAEWVo=' | xz -d | sh -"
Restart the Aria Automation services:
/opt/scripts/deploy.sh

Additional Information

The provided approach will increase RabbitMQ's memory allocation. This change will persist through upgrade and restarts.

Feedback

thumb_up Yes

thumb_down No