After scaling out Aria Automation or replacing existing nodes, We will observe EBS PodS restarting.
book
Article ID: 375859
calendar_today
Updated On:
Products
VMware Aria Suite
Issue/Introduction
After scaling out Aria Automation or replacing existing nodes, We will observe that anything dependent on publishing or consuming events from the Event Broker Service will break. This could impact template deployments, extensibility actions such as ABX actions, and vRO workflows.
Situation:
While performing the failover test, we noticed that the test completes successfully on the newly added nodes but it failed on the node-1.
Redeploying node-1 had no impact and the EBS PoDs on the node-2 and 3 restarts a couple of times.
The service restart from time to time because there are delays in the readiness probe, We see the below error in the logs:
Mon Aug 19 04:21:35 AM UTC 2024
{"status":"UP","details":{"readinessStateHealthIndicator":{"status":"UP"},"rabbitHealthContributor":{"status":"UP","details":{"version":"3.11.4"}},"diskSpaceHealthIndicator":{"status":"UP","details":{"total":151051448320,"free":96219418624,"threshold":10485760,"exists":true}},"livenessStateHealthIndicator":{"status":"UP"},"dbHealthContributor":{"status":"UP","details":{"database":"PostgreSQL","validationQuery":"isValid()"}},"ebsHealthIndicator":{"status":"UNKNOWN","details":{"reason":"Timeout","thread":"main-pool-34","elapsed":"PT9.794250824S"}}}}
Environment
VMware Aria Automation 8.16.x and later.
Cause
The issue stems from quorum queues, created by the Event Broker Service in RabbitMQ, not automatically replicating across newly added nodes in the cluster.
This results in a system failure if one of the original nodes crashes or shuts down after scaling out.
Impact:
Any functionality that relies on publishing or consuming events from the Event Broker Service may be disrupted. This includes template deployment, extensibility actions such as ABX actions, vRO workflows.
Resolution
This is a known issue, We currently do not have a fix.
Workaround:
The workaround is to manually replicate the queues across the RabbitMQ cluster by executing the following steps:
Ensure that all VMware Aria Automation nodes are up and running.
ssh to the first node in case of scale out or to one of the original nodes in case some node is replaced
run kubectl -n prelude exec -it rabbitmq-ha-0 /bin/bash
run rabbitmq-queues grow "rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"
run rabbitmq-queues grow "rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"
run rabbitmq-queues grow "rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local" "all" --vhost-pattern "/" --queue-pattern ".*"
These commands will ensure that each queue has replica on every node.
Additional Information
More information about rabbitmq-queues grow can be found here - https://www.rabbitmq.com/docs/man/rabbitmq-queues.8#grow