RabbitMQ log messages: node 'XXX' down: net_tick_timeout with quorum queues (Khepri/Raft)​
search cancel

RabbitMQ log messages: node 'XXX' down: net_tick_timeout with quorum queues (Khepri/Raft)​

book

Article ID: 425563

calendar_today

Updated On:

Products

VMware Tanzu Data Suite RabbitMQ VMware Tanzu RabbitMQ

Issue/Introduction

In the affected RabbitMQ cluster, the following messages are observed in the logs:​

  • [info] queue 'AAA' in vhost '/': Leader monitor down with noconnection, setting election timeout
  • [info] node 'BBB' down: net_tick_timeout
  • [warning] rabbit_sysmon_handler busy_dist_port

The question is whether tuning net_ticktime or Raft-related settings can prevent these events.​

Environment

  • RabbitMQ 4.1.4 with Khepri metadata store enabled​
  • Erlang/OTP 27​
  • Node-to-node distribution configured with net_ticktime = 300
  • Cluster using quorum queues (Raft/Ra) for some workloads​
  • Symptoms observed under high message-rate / heavy-traffic conditions on quorum queues​

Cause

  • The message node 'BBB' down: net_tick_timeout indicates that distributed Erlang has not received required inter-node heartbeats within the configured net_ticktime interval and has therefore marked the peer node as down.​
  • Quorum queues rely on the Ra library (implementing Raft) on top of Erlang distribution; once the underlying distribution becomes unstable or congested, further increasing net_ticktime provides diminishing returns because Raft makes its own leader and health decisions based on message timing and majority, rather than directly on net_ticktime.​
  • The warning rabbit_sysmon_handler busy_dist_port suggests that Erlang distribution ports are under heavy load or scheduling pressure (for example, excessive traffic over inter-node links), which can delay both distribution heartbeats and Raft messages and contribute to timeouts.​

Resolution

  • Avoid relying solely on increasing net_ticktime or other Raft timing parameters to address these symptoms; beyond a certain point, this only delays failure detection and can make recovery slower and less predictable.​
  • When creating streams, configure queues with x-queue-type = "stream" and apply the appropriate replication, leader election, and retention settings as described in the RabbitMQ Streams documentation.​
  • Reduce load on the Erlang distribution and the Raft layer by:​
    • Moving the heaviest-traffic quorum queues to streams, which use dedicated client connections and a different replication and storage model, thereby offloading traffic from Erlang distribution.​
    • Limiting quorum queues to workloads that strictly require quorum semantics, and keeping their number and throughput within known, sustainable bounds for the cluster.

 

Additional Information

  • Monitor Erlang distribution and cluster health (for example, scheduler utilization, distribution link metrics, and Raft-specific metrics) to detect congestion and partial partitions before they result in net_tick_timeout conditions.​
  • Review underlying network characteristics (latency, packet loss, jitter) and node resource usage (CPU, memory, file descriptors) to ensure that heartbeats and Raft traffic are not delayed by resource contention.​

Reference