Symptom 1: Intermittent 404s
This issue is often first noticed because some, but not all, gorouters will return 404s for routes to multiple healthy, running apps. This symptom is a lagging indicator of this issue.
Symptom 2: NATS Authorization Errors
During this issue, the following NATS logs may be present. Sometimes these errors can happen and NATS can recover, but not always. This symptom is not present in every case.
|
Symptom 3: High ms_since_last_registry_update Metric
This metric is the time in milliseconds since the last route registration message has been received. It is emitted every 30 seconds. Route emitter sends route registration messages every 20 seconds. Sometimes (not always) this bug can cause this metric to increase forever until mitigation steps are taken. This symptom is not present in every case.
Symptom 4: total_routes Discrepancy
This metric reports the current number of routes registered per gorouter. It is emitted every 30 seconds. All gorouters should have roughly the same number of routes. There might be small discrepancies if one gorouter gets a route registration message before another or if one gorouter missed an unregistration message and so the old route is still in the routing table just waiting to eventually be pruned. However, large discrepancies (10+ routes) likely indicate that there is a nats split brain and not all gorouters are getting the same routes.
Symptom 5: “backend-endpoint-failed” Certificate Errors
Operators might also notice a slew of “backend-endpoint-failed” log with the message “incomplete request (tls: failed to verify certificate: x509: certificate is valid for process-guid-1, not process-guid-2)" followed by a “prune-failed-endpoint” log. This indicates that the routes on that gorouter are not being updated. As long as apps stay running in the same container with no changes, then a gorouter will continue to be able to route to it, even if it stops getting route updates from nats. However, as soon as app containers are created and destroyed, gorouter does not get those updates. These logs happen when gorouter attempts to talk to an app, but the container has moved and a new app is running on the old port. In this case, gorouter logs that it isn’t the app it expected and it prunes the route. Once all app instances for that route are pruned, gorouter starts returning a 404 for that route. This symptom is not present in every case.
{ |
We are not sure which version introduced this bug. This bug been documented as occurring in
TAS versions: 2.11.45, 4.0.24, 4.0.28, and 6.0.9.
When NATS servers get overwhelmed they close and then re-establish connections with the other NATS nodes. However, there is a bug in the reconnection algorithm that can result in the nodes never reconnecting. We suspect that this is the root cause of this bug. Read here for more details.
There is no permanent resolution for this issue yet.
Mitigations
If your monitoring indicates that you are experiencing this issue you should: