Apply changes fails on pre-stop script for VMware Tanzu RabbitMQ [VMs] on 1.18.5-7

Products

VMware RabbitMQ

Issue/Introduction

If you are managing a BOSH deployed RabbitMQ deployment and it fails on the pre-stop script on one of the RabbitMQ servers as shown below:

Task 690226
Task 690226 | 23:38:57 | Preparing deployment: Preparing deployment (00:00:03)
Task 690226 | 23:39:00 | Preparing deployment: Rendering templates (00:00:06)
Task 690226 | 23:39:06 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 690226 | 23:39:08 | Updating instance rabbitmq-haproxy: rabbitmq-haproxy/850ed5d1-6356-4dbd-9b0c-ff3a65ae456f (0) (canary)
Task 690226 | 23:39:08 | Updating instance rabbitmq-broker: rabbitmq-broker/9d06d068-acff-48bf-9beb-4f5f1e954bdc (0) (canary)
Task 690226 | 23:39:08 | Updating instance rabbitmq-server: rabbitmq-server/a209a20e-da71-406c-b422-b741739c9b64 (0) (canary)
Task 690226 | 23:40:04 | Updating instance rabbitmq-haproxy: rabbitmq-haproxy/850ed5d1-6356-4dbd-9b0c-ff3a65ae456f (0) (canary) (00:00:56)
Task 690226 | 23:40:09 | Updating instance rabbitmq-broker: rabbitmq-broker/9d06d068-acff-48bf-9beb-4f5f1e954bdc (0) (canary) (00:01:01)
Updating deployment:
Expected task '690226' to succeed but state is 'error'
Exit code 1
Task 690226 | 00:41:28 | Updating instance rabbitmq-server: rabbitmq-server/a209a20e-da71-406c-b422-b741739c9b64 (0) (canary) (01:02:20)
L Error: Action Failed get_task: Task 87d05110-2d6b-4d8b-7878-9ea272841560 result: 1 of 1 pre-stop scripts failed. Failed Jobs: rabbitmq-server.
Task 690226 | 00:41:28 | Error: Action Failed get_task: Task 87d05110-2d6b-4d8b-7878-9ea272841560 result: 1 of 1 pre-stop scripts failed. Failed Jobs: rabbitmq-server.
Task 690226 Started Tue Jun 30 23:38:57 UTC 2020
Task 690226 Finished Wed Jul 1 00:41:28 UTC 2020
Task 690226 Duration 01:02:31
Task 690226 error
===== 2020-07-01 00:41:30 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=172.16.25.15 --deployment=p-rabbitmq-4c82edd404a24fb8b45f deploy --no-redact /var/tempest/workspaces/default/deployments/p-rabbitmq-4c82edd404a24fb8b45f.yml"; Duration: 3821s; Exit Status: 1
Exited with 1.
Exited with 1

If you ssh to one of the RabbitMQ servers and run the pre-stop script independently (located at /var/vcap/jobs/rabbitmq-server/bin/pre-stop), you will see errors talking about mirrors not being synchronized:

2020-07-02T11:20:20Z: Running pre-stop script
2020-07-02T11:20:20Z: Checking if node is quorum queue critical
Checking if node rabbit@b4dad2a41925429dd08f598709300c8a is critical for quorum of any quorum queues ...
Node rabbit@b4dad2a41925429dd08f598709300c8a reported no quorum queues with minimum quorum
2020-07-02T11:20:21Z: Checking if node is mirror queue critical
Checking if node rabbit@b4dad2a41925429dd08f598709300c8a is critical for data safety of any classic mirrored queues ...
queue 'message-filter.broadcast.0.29c6c341-8cc1-4bc2-bd14-0f1506fefd12.queue' in vhost '0699ca8d-e65f-4a2b-b6e3-77dc65688ef4' would lose its only synchronised replica (master) if node rabbit@b4dad2a41925429dd08f598709300c8a is stopped
queue 'message-filter.broadcast.0.d89ec67b-19bc-45c3-b74e-5ca7ffeb2446.queue' in vhost '4bfc9c7f-0eb0-4dad-acca-94b3266ec6bb' would lose its only synchronised replica (master) if node rabbit@b4dad2a41925429dd08f598709300c8a is stopped
queue 'message-filter.broadcast.0.932c9f7c-3c01-4480-b7fc-94a4360e4348.queue' in vhost '00fe0185-7a19-4f16-89f7-ee2d26b90148' would lose its only synchronised replica (master) if node rabbit@b4dad2a41925429dd08f598709300c8a is stopped

Environment

Product Version: 1.18

Resolution

In versions 1.18.5, 1.18.6 1.18.7, 1.19.1, 1.19.2 there is a pre-stop script that checks if there are any unsynchronized mirrors in the cluster and waits 1 hour until they are synced. If mirrors don't sync, then the apply changes times out and fails. There are two options to workaround the issue:

1. You have to find the queues in question and synchronize them - you can add a policy to synchronize all queues by adding "ha-synch-mode: automatic" your HA policies. This will automatically sync mirrors with leaders.

2. Alter the pre-stop script on all the servers each time you want to recreate/update/restart the servers. You can change the script by adding "exit 0" to the second line of the script as follows:

 #!/bin/bash 
exit 0 # Add this line to the script
set -eo pipefail [ -z "$DEBUG" ] || set -x

This will bypass the check, allowing the apply changes to continue. If there are exclusive queues in the cluster, then you must use option 2.

It is recommended to upgrade to 1.18.8+ or 1.19.3+ as it will allow operators to opt-in/opt-out on the pre-stop script in Opsmanager.

If you have any other questions with regard to this issue please contact Tanzu Support.