router/237820bf-858b-40aa-9ef0-c803cd36972b:/var/vcap/sys/log/gorouter# tail -f gorouter.stdout.log {"log_level":6,"timestamp":"2021-06-21T16:05:39.010266308Z","message":"nats-connection-error","source":"vcap.gorouter.nats","data":{"error":"nats: no servers available for connection"}}
[6] 2021/06/18 10:55:45.282965 [ERR] Error trying to connect to route: dial tcp X.X.X.X:4223: i/o timeout
netcat -vzw 3 X.X.X.X 4222
$ bosh -d cf-0e1cab2c5bcfbcb648fb ssh nats/0 Using environment 'X.X.X.X' as client 'ops_manager' Using deployment 'cf-0e1cab2c5bcfbcb648fb' Task 4498. Done <removed for brevity> nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~$ sudo su -
nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# ps aux | grep nats.conf root 4932 0.0 0.1 12944 1020 pts/0 S+ 19:09 0:00 grep --color=auto nats.conf vcap 8800 0.0 0.0 4364 688 ? S<s Jun09 0:23 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf vcap 8820 0.0 1.6 898836 16856 ? S<l Jun09 13:37 /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf
nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# kill -3 8820
nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# monit summary The Monit daemon 5.2.5 uptime: 13d 1h 23m Process 'nats' running Process 'nats-tls' running Process 'loggregator_agent' running Process 'loggr-syslog-agent' running Process 'metrics-discovery-registrar' running Process 'metrics-agent' running Process 'loggr-forwarder-agent' running Process 'prom_scraper' running Process 'bosh-dns' running Process 'bosh-dns-resolvconf' running Process 'bosh-dns-healthcheck' running Process 'system-metrics-agent' running System 'system_localhost' running nats/f4534f81-cff7-434e-8431-bd47fdefb9ad:~# ps aux | grep nats.conf vcap 4994 0.5 0.0 4364 684 ? S<s 19:17 0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf vcap 5009 2.7 1.4 898580 14608 ? S<l 19:17 0:00 /var/vcap/packages/gnatsd/bin/gnatsd -c /var/vcap/jobs/nats/config/nats.conf root 5017 0.0 0.1 12944 1056 pts/0 S+ 19:17 0:00 grep --color=auto nats.conf
R&D is currently working on a solution to this issue and this KB will be updated as new information is available in regards to patched versions. Until then, the workaround is to upgrade to a NATS release where this issue is less likely to show itself. This is NATS v38+, we will discuss why this release is less prone to this issue shortly. NATS v38+ additionally also offers features which allows alerting to pick up the NATS process as failing (monit will report it failing) and more time to resolve the issue manually before any effects can take place (pruning threshhold configurations). To work around this issue, upgrade to a Tanzu Application Service (TAS) version that includes NATS v38+ until a future TAS release contains the patch. In the event this bug shows itself with nats v38, please capture the kill -3 output and share the file with Tanzu Support.
Though we are still researching the reason for this issue, there are commonalities for when this issue was observed. One of the observations is underlying network issues.
NATS has 2 servers per NATS VM, a TLS server, and a non-TLS server. Both are clustered together. If there are 2 NATS VMs then there are 4 NATS servers in the cluster (2 TLS and 2 non-TLS).
Some clients are configured to connect to TLS NATS and some are configured to non-TLS NATS. NATS has this feature called graph auto-discovery. This feature allows clients that are connecting to NATS to be aware of all servers in the cluster. In NATS v34 however, this means that TLS configured NATS clients can connect to non-TLS NATS (these connections will be successful) and non-TLS NATS clients can attempt to connect to TLS NATS (these connections will fail due to certificate reasons). This is an issue that is fixed in NATS v38. This is a reason underlying network issues is believed to be a contributor to the issue.
This bug might be more likely when a network issue occurs and many clients reconnect to NATS simultaneously, as well as push many messages to them.
There are a few key benefits in NATS v38+ compared to NATS v34:
NATS clients (both TLS and non-TLS) will only be aware of NATS cluster peers that match their configuration. In NATS v34 all NATS servers (both TLS and non-TLS) are advertised to every client that connects. When the connection is interrupted, the client will try to reconnect to any peer in the cluster. This can be problematic for various reasons.
For example, many TLS clients will connect to regular NATS and this can create an imbalance in server flow distribution which can increase the likelihood of this issue surfacing. In NATS v38 and v40, this is no longer possible as TLS clients will not be aware of non-TLS NATS and non-TLS clients will not be aware of TLS NATS.
Monit will report NATS as failing if it gets in this state as there is a monit healthcheck that tests TCP connections to NATS. If this issue surfaces, within minutes the TCP connections fail. This is a clear indicator that the issue is present.
The service discovery controller’s (SDC) route pruning timer is now configurable and defaults to 10 minutes. In NATS v34, it was non configurable and defaulted to 3 minutes. This means that there is more time to react and catch this before internal routes are pruned. This setting is in the TAS tile under Networking tab: "Internal routes staleness threshold (seconds)". This scenario has a very low chance, as the SDC connection to NATS would have to remain uninterrupted and all diego-cells route emitters would have to be interrupted while the bug is present for this to happen).