RMQ federation exchange remains down after upgrade or cert rotation (during "Apply Change")
search cancel

RMQ federation exchange remains down after upgrade or cert rotation (during "Apply Change")

book

Article ID: 293247

calendar_today

Updated On:

Products

VMware RabbitMQ

Issue/Introduction

"Federation" queue remains down once RMQ servers go down during Apply Change for version upgrade or leaf cert rotation. The current tile version is 1.19 RMQ version 3.8.3 but it may also occur on other versions, where federation exchange-queue is in place.

We found out federation queue was missing and that's why federation is suspended while the servers have properly started up. To solve the issue, we had to delete the exchange manually and try to recreate it.

In addition, I should mention that federation in only one of two exchanges was suspended every time we faced this issue.

This article will answer the following questions:

1. What is the root cause for federation not resuming automatically after RMQ update?
2. Why is it that the federation in only one exchange remains down?

Environment

Product Version: 1.19

Resolution

This issue is explained in the following file: https://github.com/rabbitmq/rabbitmq-federation/issues/111


Problem

When the federation needs to move to another node, it is using a different key. This causes the crash and the way to recover is to recreate the federation or upgrade to RMQ 3.8.6.

This issues persists in 3.8.3.

The related messages inside the logs are below:

2020-08-24 05:36:00.255 [info] <0.713.0> Mirrored queue '3rd.requests.service.exchange-queue' in vhost 'PROD': Adding mirror on node [email protected]: <0.773.0>
 
** When handler state == []
   253: ** Reason == {{badmatch,{error,{{{badmatch,{error,{{unable_to_parse_uri,no_scheme},[138,......
 
2020-08-24 05:42:56.017 [info] <0.773.0> Mirrored queue '3rd.requests.service.exchange-queue' in vhost 'PROD': Slave <[email protected]> saw deaths of mirrors <[email protected]>
 
2020-08-24 05:42:56.019 [info] <0.773.0> Mirrored queue '3rd.requests.service.exchange-queue' in vhost 'PROD': Promoting slave <[email protected]> to master

The badmatch and promotion are visible only for the 3rd.requests.service.exchange-queue but not for the second federation exchange queue.


Workaround

The workaround for the current version is to recreate affected queues. For a permanent fix, upgrade to the next version that contain RabbitMQ 3.8.6. This version of RMQ is not yet bundled into the 1.20 RMQ tile, it will be completed in the next few weeks.


Badmatch first occurred on [email protected] which later promoted itself as master. When the federation needed to move to another node, it used a different key - this is what causes the crash.