Sometimes, network issues are difficult to reason with especially when they are persistent. In addition to logs, which reveal underlying causes like net_tick_timeout, connection.close, and inconsistent_database errors in the case of an mnesia based cluster, we recommend importing Erlang Distribution dashboards in a cluster monitored by Prometheus/Grafana. The Erlang distribution dashboard, is one of the prebuilt Grafana dashboards for RabbitMQ and is briefly mentioned in the RabbitMQ Prometheus/Grafana doc. This article captures a few screenshots that capture both healthy and unhealthy states of the cluster.
The screenshots below show a healthy cluster, where the number of established distribution links matches the total, and one where the number of established distribution links and the state of the distribution links(with orange squares) clearly show a disruption.
Healthy cluster:
Disconnected cluster:
Note that this is applicable to all versions of RabbitMQ. However, a Khepri based cluster available starting with RabbitMQ 4.0 is more resilient to network failures.