RabbitMQ log messages: node 'XXX' down: net_tick_timeout with quorum queues (Khepri/Raft)

search cancel

book

calendar_today

VMware Tanzu Data Suite RabbitMQ VMware Tanzu RabbitMQ

In the affected RabbitMQ cluster, the following messages are observed in the logs:

[info] queue 'AAA' in vhost '/': Leader monitor down with noconnection, setting election timeout
[info] node 'BBB' down: net_tick_timeout
[warning] rabbit_sysmon_handler busy_dist_port

The question is whether tuning net_ticktime or Raft-related settings can prevent these events.

RabbitMQ 4.1.4 with Khepri metadata store enabled
Erlang/OTP 27
Node-to-node distribution configured with net_ticktime = 300
Cluster using quorum queues (Raft/Ra) for some workloads
Symptoms observed under high message-rate / heavy-traffic conditions on quorum queues

The message node 'BBB' down: net_tick_timeout indicates that distributed Erlang has not received required inter-node heartbeats within the configured net_ticktime interval and has therefore marked the peer node as down.
Quorum queues rely on the Ra library (implementing Raft) on top of Erlang distribution; once the underlying distribution becomes unstable or congested, further increasing net_ticktime provides diminishing returns because Raft makes its own leader and health decisions based on message timing and majority, rather than directly on net_ticktime.
The warning rabbit_sysmon_handler busy_dist_port suggests that Erlang distribution ports are under heavy load or scheduling pressure (for example, excessive traffic over inter-node links), which can delay both distribution heartbeats and Raft messages and contribute to timeouts.

Avoid relying solely on increasing net_ticktime or other Raft timing parameters to address these symptoms; beyond a certain point, this only delays failure detection and can make recovery slower and less predictable.
When creating streams, configure queues with x-queue-type = "stream" and apply the appropriate replication, leader election, and retention settings as described in the RabbitMQ Streams documentation.
Reduce load on the Erlang distribution and the Raft layer by:

Moving the heaviest-traffic quorum queues to streams, which use dedicated client connections and a different replication and storage model, thereby offloading traffic from Erlang distribution.
Limiting quorum queues to workloads that strictly require quorum semantics, and keeping their number and throughput within known, sustainable bounds for the cluster.

Monitor Erlang distribution and cluster health (for example, scheduler utilization, distribution link metrics, and Raft-specific metrics) to detect congestion and partial partitions before they result in net_tick_timeout conditions.
Review underlying network characteristics (latency, packet loss, jitter) and node resource usage (CPU, memory, file descriptors) to ensure that heartbeats and Raft traffic are not delayed by resource contention.

Reference

thumb_up Yes

thumb_down No