Resolving RabbitMQ routing problems due to binding mismatches

Products

VMware Tanzu RabbitMQ

Issue/Introduction

RabbitMQ versions up to 4.x have a known class of problems that come down to bindings becoming "inconsistent" between nodes, resulting in confusing routing behavior. It usually affects transient queues, but the problem is not with queues but rather with bindings. Another important aspect to mention is that this issue is much more likely to affect topic exchange bindings (storing them involves an additional table).

Environment

RabbitMQ versions up to 4.x using the Mnesia database

Cause

There is a known issue with topic exchanges in Mnesia. This issue is resolved in RabbitMQ 4.0. x and above releases by replacing Mnesia with the Khepri database. A whole category of issues with binding inconsistency is addressed with the stabilization of Khepri, a new metadata store that uses a tree of nested objects instead of multiple tables.

With Mnesia, the original metadata store, bindings are stored in two tables, one for durable
bindings (between durable exchanges and durable queues or streams) and another for semi-durable
and transient ones (where either the queue is transient or both the queue and the exchange are).

When a node was stopped or failed, all non-replicated transient queues on that node were deleted
by the remaining cluster peers. Due to high lock contention around these tables with Mnesia, this
could take a while. In the case where the restarted (or failed) node came online before all bindings
were removed, and/or clients could begin to create new bindings concurrently, the bindings table
rows could end up being inconsistent, resulting in obscure "binding not found" errors.

Khepri avoids this problem entirely by only supporting durable entities and using a very different a
tree-based data model that makes bindings removal much more efficient and lock contention-free.

Mnesia users can work around this problem by using quorum queues or durable classic queues and durable exchanges. Their durable bindings will not be removed when a node stops.
Queues that are transient in nature can be declared as durable classic ones with a TTL of a few hours.

Resolution

## Short-term solutions

A short term solution is to remove the bindings between the exchange in question and the queues that exhibit the inconsistent routing behavior, then re-create them.

There are several ways to do it

1. Use a script to unbind, wait for a second, then rebind the queues, for example, one by one
2. Like option 1, but combined with the import of a definition file with the entire topology
3. Like option 1 but with an alternative route between the same exchange and the 4 queues created using an exchange-to-exchange binding and/or an alternate exchange [4][5]

Option 3 should help avoid any "routing disruption". It can be performed on every cluster node in sequence.

Note that creating a new route from the original exchange X to an intermediary exchange XI and then the same set of four queues will not create any duplicate messages. This topology can even try a different set of four queues as a test, and those additional copies of the messages can be discarded or moved to the production queues (the ones used by the applications) using dynamic shovels.

The same goes for Alternate Exchanges, using it won't create any duplicates if the target set of queues remains the same.

## Log term solution

Upgrade to `4.0.9` and enable Khepri, then eventually upgrade to `4.1.0`.

Khepri uses a completely different data model, and this problem is addressed fundamentally because with Khepri, there will be no more N tables to update in multiple transactions.

Additional Information

References

1. https://github.com/rabbitmq/rabbitmq-server/discussions/13030
2. https://github.com/rabbitmq/rabbitmq-server/blob/main/release-notes/4.0.1.md#bug-fixes
3. https://github.com/rabbitmq/rabbitmq-server/discussions/5076
4. https://www.rabbitmq.com/docs/e2e
5. https://www.rabbitmq.com/docs/ae
6. https://www.rabbitmq.com/docs/definitions