RabbitMQ Smoke Test failed during an upgrade.
• Failure [71.625 seconds] Smoke tests /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:15 pushes an app, sends, and reads a message from RabbitMQ: plan 'standard' [It] /var/vcap/packages/cf-rabbitmq-smoke-tests/src/rabbitmq-smoke-tests/tests/smoke_tests_test.go:82 Expected <int>: 500 to be < <int>: 300
We checked the health status of the RMQ cluster by logging into the dashboard. The first node was over 7 GB memory, substantially above the high water mark. Checking the queues, we found 8 that were in a down state, labeled "NaN (not a number)".
The 8 queues in NaN state were removed with the following rabbitmqctl eval command:
rabbitmqctl eval 'Q = rabbit_misc:r(<<"/">>, queue, <<"queue-name">>), rabbit_amqqueue:internal_delete(Q, <<"cli">>).'
Note: By running the rabbitmqctl eval command, you are removing all messages from those queues. This will result in lost messages.
We needed to use this command as they were unable to be deleted via the RabbitMQ Management UI. There are times when you can delete queues in Nan
state via the management UI.
The apps which use the queues would automatically recreate them as needed. Then we restarted RabbitMQ.
rabbitmqclt stop_app rabbitmqclt start_app
This flushed the excess memory in use on the primary node.
After performing these steps, you can re-run the operations which had failed due to RabbitMQ being down.
ALWAYS check the health of your RabbitMQ cluster before performing maintenance or upgrade tasks.