One Aria Automation node from a 3 node cluster is down/unavailable and Provisioning is not functioning

search cancel

book

calendar_today

VMware Aria Suite

Symptoms:

one Aria Automation node is down / unavailable due to Infrastructure issues
Aria Automation portal is accessible
VM provisioning is taking a long time and eventually failing with errors about Event topics e.g.: "Failed to publish event to topic: Deployment requested"
reviewing Aria Automation services using command "kubectl -n prelude get pods -o wide" only 1 pods from one node are down
reviewing RabbitMQ status using below command, only 1 node shows as active node (Ref: Resolve RabbitMQ cluster issues in vRA 8.x deployment)
seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
API calls to Aria Automation may fail with HTTP status 500 - Internal Server Error

Aria Automation 8.x

A cluster instability may be cause if one of the Aria Automation nodes went down, which may lead to issue in the Messaging Queue, RabbitMQ.

Due to RabbitMQ isolation, another RabbitMQ service was stopped, therefor the last working RabbitMQ service stopped handling any messages.

Restore the Aria Automation node, if Linux booted into Emergency console then please review this article:

To workaround the issue while the node is not working:

Before proceeding please take a Snapshot, including Memory, of the 2 available nodes from vCenter.

Identify current running RabbitMQ nodes:
kubectl -n prelude get pods -o wide | grep -Ei "name|rabbitmq"
Identify which pods are currently running the RabbitMQ application:
seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

E.g.:

Only "rabbitmq-ha-0" is active, depending to which node is down "rabbitmq-ha-1" or "rabbitmq-ha-2" the opposite has to be started
Try to start the RabbitMQ application on the node which is available but not listed as "Running Nodes" from above command.

E.g.: Node 3 of the Aria Automation cluster has the outage, Node 2 is running but RabbitMQ not reporting
kubectl exec -n prelude rabbitmq-ha-1 -- bash -c "rabbitmqctl start_app"
Validate that now 2 nodes reporting as running using the same command as Step 2:
seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
Validate provisioning is now proceeding by creating a new Request in Aria Automation portal

Was this article helpful?

thumb_up Yes

thumb_down No