After scaling out Aria Automation or replacing existing nodes, We will observe EBS PodS restarting.
search cancel

After scaling out Aria Automation or replacing existing nodes, We will observe EBS PodS restarting.

book

Article ID: 375859

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • After scaling out Aria Automation or replacing existing nodes, We will observe that anything dependent on publishing or consuming events from the Event Broker Service will break. This could impact template deployments, extensibility actions such as ABX actions, and vRO workflows.


Situation:

  • While performing the failover test, we noticed that the test completes successfully on the newly added nodes but it failed on the node-1.
  • Redeploying node-1 had no impact and the EBS PoDs on the node-2 and 3 restarts a couple of times.
  • The service restart from time to time because there are delays in the readiness probe, We see the below error in the logs:
    • Mon Aug 19 04:21:35 AM UTC 2024
      {"status":"UP","details":{"readinessStateHealthIndicator":{"status":"UP"},"rabbitHealthContributor":{"status":"UP","details":{"version":"3.11.4"}},"diskSpaceHealthIndicator":{"status":"UP","details":{"total":151051448320,"free":96219418624,"threshold":10485760,"exists":true}},"livenessStateHealthIndicator":{"status":"UP"},"dbHealthContributor":{"status":"UP","details":{"database":"PostgreSQL","validationQuery":"isValid()"}},"ebsHealthIndicator":{"status":"UNKNOWN","details":{"reason":"Timeout","thread":"main-pool-34","elapsed":"PT9.794250824S"}}}}

Environment

VMware Aria Automation 8.16.x and later.

Cause

  • The issue stems from quorum queues, created by the Event Broker Service in RabbitMQ, not automatically replicating across newly added nodes in the cluster.
  • This results in a system failure if one of the original nodes crashes or shuts down after scaling out.

Impact:

  • Any functionality that relies on publishing or consuming events from the Event Broker Service may be disrupted. This includes template deployment, extensibility actions such as ABX actions, vRO workflows.

Resolution

This is a known issue, We currently do not have a fix.

 

Workaround:

The workaround is to manually replicate the queues across the RabbitMQ cluster by executing the following steps:

  • Ensure that all VMware Aria Automation nodes are up and running.
  • ssh to the first node in case of scale out or to one of the original nodes in case some node is replaced
    • run kubectl -n prelude exec -it rabbitmq-ha-0 /bin/bash
    • run rabbitmq-queues grow "rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"
    • run rabbitmq-queues grow "rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"
    • run rabbitmq-queues grow "rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"

These commands will ensure that each queue has replica on every node.

Additional Information

More information about rabbitmq-queues grow can be found here - https://www.rabbitmq.com/docs/man/rabbitmq-queues.8#grow