Event Broker Service pods continually crash then restart causing Node(s) to stop and restart
search cancel

Event Broker Service pods continually crash then restart causing Node(s) to stop and restart

book

Article ID: 314724

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms

  • Aria Automation requests fail with:

Failed to publish event to topic: Deployment resource action requested

Failed to publish event to topic: Deployment requested

  • "Failed to publish event to topic: Deployment resource action requested" or requests do not proceed past the "INITIALIZATION_IN_PROGRESS" stage.
  • Requests may not proceed past the "INITIALIZATION_IN_PROGRESS" stage.

INITIALIZATION_FAILED

  • Event Broker Service (ebs-app) pods are crashing / restarting after some time.
  • The RabbitMQ logs located under /var/log/services-logs/prelude/rabbitmq-ha-0/file-logs/rabbitmq-ha.log contain memory resource limit alarms similar to:
    2024-04-11 03:31:38.592678+00:00 [info] <0.523.0> vm_memory_high_watermark clear. Memory used:1022529536 allowed:1024000000
    2024-04-11 03:31:38.593100+00:00 [warning] <0.521.0> memory resource limit alarm cleared on node 'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local'
    2024-04-11 03:31:38.593177+00:00 [warning] <0.521.0> memory resource limit alarm cleared across the cluster
    2024-04-11 03:31:39.594883+00:00 [info] <0.523.0> vm_memory_high_watermark set. Memory used:1034166272 allowed:1024000000
    2024-04-11 03:31:39.595221+00:00 [warning] <0.521.0> memory resource limit alarm set on node 'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local'.

Note: On the other nodes in the Aria Automation cluster, the RabbitMQ logs are kept at:
/var/log/services-logs/prelude/rabbitmq-ha-1/file-logs/rabbitmq-ha.log
/var/log/services-logs/prelude/rabbitmq-ha-2/file-logs/rabbitmq-ha.log

Environment

  • VMware Aria Automation 8.x

Cause

  • The default memory assignment of 1GB for the RabbitMQ component may be insufficient.
  • RabbitMQ was migrated to using Quorum Queues instead of Mirror queues, since the latter was deprecated. This however raises the memory requirements. Customers in larger environments are more likely to run into this issue.

Resolution

Prerequisites

  • Create a snapshot of the Aria Automation nodes using a Day 2 Operation with Aria Suite Lifecycle.

Procedure: Increase VM and RabbitMQ memory allocation

  1. Shut down the Aria Automation cluster using a Day 2 operation from Aria Suite Lifecycle.
  2. Login to vCenter and increase the memory of each Aria Automation virtual machine appliance by 1GB.
  3. Power on the Aria Automation appliance nodes and SSH into one node in the cluster.
  4. Apply the configuration by running:
    vracli cluster exec -- bash -c "base64 -d <<< '/Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AOxAXhdABGIBOkJeg/QIyaVI9J6wrAp1rezelhCpStNdFpnEPpn3HE3NIKUz/XPNckpYqB4dmL9sez8SMlRunU1o6W08AHGeZKNB1JZCgj3kL3qZoQ6LQ9wD8BNnQU8nOvkMAVON/QUWCTo//FHADweFOMd9N7vmcgk1L/CdCPO+0P5T7+hMeJggXwOh5Yfr03fCMWLPEUgUW1lAv6eDKrkYqb70lAZrfZISDKxRkYEHp60E9v5ikeGaRY+W89oDIs7hkanCRbfdUKeA4cGxWrJGF0GaRwC74G0xGMxl2DI44zOUoIvZ5cJDfDVV5zg8wc7bPjWkDS5CLFmmowMDIQ+Kp1zCGOsmWLIk/jnJEuUA/TQkliBV2vQqBZasuvKe7JslHwLiCXFY8WEk6Gkip6k774xIkNchkL27WkGCqiu8xTOw5sC3DgxX/PAXRvybkT95Lgzr+tWp95dP39iolMHfLDH7flMQlkjVS3cU8Mdhcb5ryrRSGrhP2b/7QoAOevO445x09sAAZQDsgcAAMxqNrCxxGf7AgAAAAAEWVo=' | xz -d | sh -"
  5. Restart the Aria Automation services:
    /opt/scripts/deploy.sh

Additional Information

  • The provided approach will increase RabbitMQ's memory allocation. This change will persist through upgrade and restarts.