"nats-connection-error" in Gorouter when connecting to non-TLS NATS

search cancel

"nats-connection-error" in Gorouter when connecting to non-TLS NATS

book

Article ID: 298245

calendar_today

Updated On: 09-26-2023

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article covers the details of a rare bug in NATS connectivity. It also provides a workaround for this issue in the rare case it occurs, and steps that help you identify the cause of the issue.

In NATS release v34 and v38, a rare bug has been observed where non-TLS NATS servers become unavailable for new connections but continue to service existing connections.

Symptoms

Any job that connects to non-TLS NATS shows repeated errors indicating there are no NATS servers available. The following are examples of jobs that connect to non-TLS NATS: route_emitter, route_registrar, gorouter, service_discovery_controller
In NATS v34, the NATS process does not appear to be failing on the NATS VM.
Starting in NATS v38, monit does detect when NATS is in this state and does turn the job to failing.

Here is an example log from gorouter that indicates that this bug is present:

router/237820bf-858b-40aa-9ef0-c803cd36972b:/var/vcap/sys/log/gorouter# tail -f gorouter.stdout.log
{"log_level":6,"timestamp":"2021-06-21T16:05:39.010266308Z","message":"nats-connection-error","source":"vcap.gorouter.nats","data":{"error":"nats: no servers available for connection"}}

Here is an example log from the NATS job itself. This log repeats many times in the nats.stderr.log:

[6] 2021/06/18 10:55:45.282965 [ERR] Error trying to connect to route: dial tcp X.X.X.X:4223: i/o timeout

Port 4223 is the non-TLS cluster peer port for the NATS cluster. This indicates that it can not connect to its peer at IP X.X.X.X on port 4223.

Note: These error logs do not just appear once, but instead will repeat over and over as the NATS clients continuously try to reconnect. If the bug is present, then the clients keep trying to reconnect and log the error on every failure.

Another test you can do to confirm if NATS is unavailable when it should be available is to try to netcat to a NATS IP at port 4222. Port 4222 is the non-TLS client port for the NATS cluster, where X.X.X.X is a NATS VM IP.

netcat -vzw 3 X.X.X.X 4222

If this command times out but the machine is listening on that port, then it may indicate the issue is present.

Environment

Product Version: 2.10

Resolution

In either of the NATS versions, manual intervention is most likely required to restore NATS to a functional state, as its not a self recovering bug. Running monit restart for the NATS job restores functionality, however it does not help with understanding "why" NATS got into this failing state. To help with the "why," we need to end the NATS process so it dumps the goroutine stack traces into stderr.

How to end the NATS process to force dump of goroutine stack traces into stderr

Perform the following on each NATS VM:

1. SSH onto the NATS VM and become the sudo user:

$ bosh -d cf-0e1cab2c5bcfbcb648fb ssh nats/0
Using environment 'X.X.X.X' as client 'ops_manager'

Using deployment 'cf-0e1cab2c5bcfbcb648fb'

Task 4498. Done

<removed for brevity>

nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~$ sudo su -

2. Find the PID for the NATS process:

nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# ps aux | grep nats.conf
root      4932  0.0  0.1  12944  1020 pts/0    S+   19:09   0:00 grep --color=auto nats.conf
vcap      8800  0.0  0.0   4364   688 ?        S<s  Jun09   0:23 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf
vcap      8820  0.0  1.6 898836 16856 ?        S<l  Jun09  13:37 /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf

In this example, the NATS PID is 8820.

3. Kill the NATS PID with -3 so the goroutines dump to stderr.

nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# kill -3 8820

Note: The NATS process should automatically restart and functionality should be restored. Full foundation functionality may not be restored until these steps are repeated on every NATS VM.

4. Ensure NATS is running with a new PID:

nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# monit summary
The Monit daemon 5.2.5 uptime: 13d 1h 23m

Process 'nats'                      running
Process 'nats-tls'                  running
Process 'loggregator_agent'         running
Process 'loggr-syslog-agent'        running
Process 'metrics-discovery-registrar' running
Process 'metrics-agent'             running
Process 'loggr-forwarder-agent'     running
Process 'prom_scraper'              running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running
Process 'system-metrics-agent'      running
System 'system_localhost'           running
nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# ps aux | grep nats.conf
vcap      4994  0.5  0.0   4364   684 ?        S<s  19:17   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf
vcap      5009  2.7  1.4 898580 14608 ?        S<l  19:17   0:00 /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf
root      5017  0.0  0.1  12944  1056 pts/0    S+   19:17   0:00 grep --color=auto nats.conf

You can see that the NATS process is running and has a new PID, which is 5009.

After repeating this for all the NATS VMs, the non-TLS peers and clients should be able to reconnect. Obtain the log bundle for the NATS instance group and upload it to Tanzu Support.

Workaround

R&D is currently working on a solution to this issue and this KB will be updated as new information is available in regards to patched versions. Until then, the workaround is to upgrade to a NATS release where this issue is less likely to show itself. This is NATS v38+, we will discuss why this release is less prone to this issue shortly. NATS v38+ additionally also offers features which allows alerting to pick up the NATS process as failing (monit will report it failing) and more time to resolve the issue manually before any effects can take place (pruning threshhold configurations). To work around this issue, upgrade to a Tanzu Application Service (TAS) version that includes NATS v38+ until a future TAS release contains the patch. In the event this bug shows itself with nats v38, please capture the kill -3 output and share the file with Tanzu Support.

Additional Information

TAS versions 2.9.2 - 2.9.13 utilize NATS v34
TAS versions 2.9.14+ utilize NATS v38+
TAS versions 2.10.0 - 2.10.5 utilize NATS v34
TAS versions 2.10.6+ utilize NATS v38+

Though we are still researching the reason for this issue, there are commonalities for when this issue was observed. One of the observations is underlying network issues.

NATS has 2 servers per NATS VM, a TLS server, and a non-TLS server. Both are clustered together. If there are 2 NATS VMs then there are 4 NATS servers in the cluster (2 TLS and 2 non-TLS).
Some clients are configured to connect to TLS NATS and some are configured to non-TLS NATS. NATS has this feature called graph auto-discovery. This feature allows clients that are connecting to NATS to be aware of all servers in the cluster. In NATS v34 however, this means that TLS configured NATS clients can connect to non-TLS NATS (these connections will be successful) and non-TLS NATS clients can attempt to connect to TLS NATS (these connections will fail due to certificate reasons). This is an issue that is fixed in NATS v38. This is a reason underlying network issues is believed to be a contributor to the issue.
This bug might be more likely when a network issue occurs and many clients reconnect to NATS simultaneously, as well as push many messages to them.

There are a few key benefits in NATS v38+ compared to NATS v34:

NATS clients (both TLS and non-TLS) will only be aware of NATS cluster peers that match their configuration. In NATS v34 all NATS servers (both TLS and non-TLS) are advertised to every client that connects. When the connection is interrupted, the client will try to reconnect to any peer in the cluster. This can be problematic for various reasons.

For example, many TLS clients will connect to regular NATS and this can create an imbalance in server flow distribution which can increase the likelihood of this issue surfacing. In NATS v38 and v40, this is no longer possible as TLS clients will not be aware of non-TLS NATS and non-TLS clients will not be aware of TLS NATS.
Monit will report NATS as failing if it gets in this state as there is a monit healthcheck that tests TCP connections to NATS. If this issue surfaces, within minutes the TCP connections fail. This is a clear indicator that the issue is present.
The service discovery controller’s (SDC) route pruning timer is now configurable and defaults to 10 minutes. In NATS v34, it was non configurable and defaulted to 3 minutes. This means that there is more time to react and catch this before internal routes are pruned. This setting is in the TAS tile under Networking tab: "Internal routes staleness threshold (seconds)". This scenario has a very low chance, as the SDC connection to NATS would have to remain uninterrupted and all diego-cells route emitters would have to be interrupted while the bug is present for this to happen).

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No