Question about rabbitmq missed heatbeat
search cancel

Question about rabbitmq missed heatbeat

book

Article ID: 426848

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

An application with 10 container instances bound to a RabbitMQ service experienced a MissedHeartbeatException on a single instance following a brief network interruption.

Details:

  • Event: Scheduled maintenance on a Diego Cell resulted in a predicted 0.2-second network outage.

  • Symptom: Only one out of ten containers reported the following error: rabbitmq MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds.

  • Observation: The remaining nine containers maintained stable connections despite the shared network event. There were no reported container crash events.

A question is raised: Why did that log only occur in one container?

Cause

The 60 seconds refers to the timeout limit we have configured, not the length of the network issue. Because each container sends its heartbeat at a different time, only the container that happened to be 'talking' during that specific 0.2-second window was affected. It's essentially a case of 'bad timing' for that specific connection.

Resolution

Even though the 0.2s network outage was the "trigger," it's possible for a 0.2s gap to directly cause a 60s timeout.

Here is an example. The 0.2s network interrupt occurred right as the application was attempting to send or receive a heartbeat packet.

                                 0.2s Blip
                               | |
TIME (sec)          0s           | 60s |         120s
                     |--------------|--------------|
                   ^              ^              ^
(container A)   HB Sent     MISS!(TIMEOUT)   HB Sent



TIME (sec)           0s    30s    | 60s |    90s  120s
                   |--------------|--------------|
                  ^ ^
(container B)   HB Sent     HB Sent

 

The timeout is negotiated between the client and RabbitMQ server at the time of connection. In the context of a distributed environment like Tanzu Application Service (TAS), it is very common for clients and the RabbitMQ server to have staggered or "discreet" start times for their connection negotiations.

As it is a very tiny time window, the chance is low for every container hitting it.