RabbitMQ Pods Crash with {error,timeout_waiting_for

Products

VMware Tanzu Data Suite RabbitMQ VMware Tanzu RabbitMQ

Issue/Introduction

When performing a rolling restart of a RabbitMQ cluster running on Kubernetes, one or more pods may fail to start and enter a crash loop with the error:

2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> 
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> BOOT FAILED
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> ===========
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> Exception during startup:
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> 
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> exit:timeout_waiting_for_leader
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> 
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> rabbit_khepri:setup/1, line 278
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> rabbit:run_prelaunch_second_phase/0, line 396
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> rabbit:start/2, line 922
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> application_master:start_it_old/4, line 293
2025-07-08 09:54:46.580662+00:00 [error] <0.2762.0> 
 
BOOT FAILED
===========
Exception during startup:
 
exit:timeout_waiting_for_leader
 
 rabbit_khepri:setup/1, line 278
 rabbit:run_prelaunch_second_phase/0, line 396
 rabbit:start/2, line 922
 application_master:start_it_old/4, line 293

and later on all the pods can run into the same boot failure and it tends to be an infinite crash loop.

Cause

This behavior is generally observed when Khepri is enabled as the metadata store. In such cases, a majority of the cluster nodes must be online for successful leader election. As described in the RabbitMQ documentation, see <Restarting a Cluster Member>

"
When a cluster member is restarted or stopped, the remaining nodes may lose their quorum. This may affect the ability to start a node.

For example, in a cluster of 5 nodes where all nodes are stopped, the first two starting nodes will wait for the third node to start before completing their boot and start to serve messages. That’s because the metadata store needs at least 3 nodes in this example to elect a leader and complete the initialization process. In the meantime the first two nodes wait and may time out if the third one does not appear.
"

However, in lower RabbitMQ version(below 4.0.6), the default timeout value for khepri to elect the leader, 30s, was too-short and sometimes the cluster members need more time to start, which result in leader election failure and then crash with error,timeout_waiting_for_leader.

Resolution

Temporary Workaround

If pods just need more time to start, the khepri_leader_wait_retry_timeout setting can be increased. This is an advanced RabbitMQ configuration, and in versions below 4.0.6, it defaults to 30,000 milliseconds (30 seconds).
You may increase this value—for example, to 300,000 milliseconds (5 minutes)—to allow more time for leader election. For RabbitMQ clusters deployed via the RabbitMQ Kubernetes Operator, this can be configured as follows:

---
apiVersion: rabbitmq.com/v1beta1s
kind: RabbitmqCluster
...
spec:
  rabbitmq:
    advancedConfig: |
      [
        {rabbit, [
                  ...
                  {khepri_leader_wait_retry_timeout, 300000}
        ]},
...

Once updated, make it affective by kubectl apply or do a rolling restart, make sure such change has been done by:

kubectl -n <namespace> get rabbitmqcluster <cluster-name> -o yaml|grep timeout

Resolution

To permanently resolve the issue, upgrade to RabbitMQ version 4.0.6 or higher, where the default timeout for Khepri leader election has been increased to 5 minutes. Alternatively, upgrade to RabbitMQ 4.1.0 or later, where the flawed retry mechanism has been improved, eliminating this issue altogether.