RabbitMQ: Erlang Distribution dashboards for network partition errors

search cancel

RabbitMQ: Erlang Distribution dashboards for network partition errors

book

Article ID: 403541

calendar_today

Updated On:

Products

VMware Tanzu RabbitMQ

Issue/Introduction

Sometimes, network issues are difficult to reason with especially when they are persistent. In addition to logs, which reveal underlying causes like net_tick_timeout, connection.close, and inconsistent_database errors in the case of an mnesia based cluster, we recommend importing Erlang Distribution dashboards in a cluster monitored by Prometheus/Grafana. The Erlang distribution dashboard, is one of the prebuilt Grafana dashboards for RabbitMQ and is briefly mentioned in the RabbitMQ Prometheus/Grafana doc. This article captures a few screenshots that capture both healthy and unhealthy states of the cluster.

Resolution

The screenshots below show a healthy cluster, where the number of established distribution links matches the total, and one where the number of established distribution links and the state of the distribution links(with orange squares) clearly show a disruption.

Healthy cluster:

Disconnected cluster:

Note that this is applicable to all versions of RabbitMQ. However, a Khepri based cluster available starting with RabbitMQ 4.0 is more resilient to network failures.

Feedback

thumb_up Yes

thumb_down No