An application with 10 container instances bound to a RabbitMQ service experienced a MissedHeartbeatException on a single instance following a brief network interruption.
Details:
Event: Scheduled maintenance on a Diego Cell resulted in a predicted 0.2-second network outage.
Symptom: Only one out of ten containers reported the following error: rabbitmq MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds.
Observation: The remaining nine containers maintained stable connections despite the shared network event. There were no reported container crash events.
A question is raised: Why did that log only occur in one container?
The 60 seconds refers to the timeout limit we have configured, not the length of the network issue. Because each container sends its heartbeat at a different time, only the container that happened to be 'talking' during that specific 0.2-second window was affected. It's essentially a case of 'bad timing' for that specific connection.
Even though the 0.2s network outage was the "trigger," it's possible for a 0.2s gap to directly cause a 60s timeout.
Here is an example. The 0.2s network interrupt occurred right as the application was attempting to send or receive a heartbeat packet.
0.2s Blip
| |
TIME (sec) 0s | 60s | 120s
|--------------|--------------|
^ ^ ^
(container A) HB Sent MISS!(TIMEOUT) HB Sent
TIME (sec) 0s 30s | 60s | 90s 120s
|--------------|--------------|
^ ^
(container B) HB Sent HB Sent
The timeout is negotiated between the client and RabbitMQ server at the time of connection. In the context of a distributed environment like Tanzu Application Service (TAS), it is very common for clients and the RabbitMQ server to have staggered or "discreet" start times for their connection negotiations.
As it is a very tiny time window, the chance is low for every container hitting it.