nats-tls-healthcheck job restarts after logging error "failed to connect to NATS server"

search cancel

nats-tls-healthcheck job restarts after logging error "failed to connect to NATS server"

book

Article ID: 374936

calendar_today

Updated On: 08-19-2024

Products

VMware Tanzu Application Service

Issue/Introduction

Nats-tls-healthcheck job on NATS instance restarts intermittently, though the check target nats-tls-wrapper is not reported with any failure.

In bpm.log, nats-tls-healthcheck is reported not running.

[UTC May  11 14:16:51] error    : 'nats-tls-healthcheck' process is not running
[UTC May  11 14:16:51] info     : 'nats-tls-healthcheck' trying to restart
[UTC May  11 14:16:51] info     : 'nats-tls-healthcheck' start: /var/vcap/jobs/bpm/bin/bpm
[UTC May  11 14:16:52] info     : 'nats-tls-healthcheck' process is running with pid 51541
[UTC May  24 09:11:33] error    : 'nats-tls-healthcheck' process is not running
[UTC May  24 09:11:33] info     : 'nats-tls-healthcheck' trying to restart
[UTC May  24 09:11:33] info     : 'nats-tls-healthcheck' start: /var/vcap/jobs/bpm/bin/bpm
[UTC May  24 09:11:34] info     : 'nats-tls-healthcheck' process is running with pid 54912

Just before that (around 5~10 seconds ago), healthcheck.stderr.log reports error "failed to connect to NATS server: <detailed reason>"

2024/05/11 14:16:41 failed to connect to NATS server: nats: no servers available for connection
2024/05/24 09:11:24 failed to connect to NATS server: nats: no servers available for connection

Environment

TAS

2.11.26+
2.13.5+
4.x
5.x
6.x

Cause

Nats-tls-healthcheck starts a NATS connection to local nats-tls-wrapper at port 4224 every 10 seconds. In the case it fails to establish a connection, nats-tls-healthcheck logs "failed to connect to NATS server: <detailed error>" and exits. Monit daemon will detect the process termination immediately and start nats-tls-healthcheck job.This is by design and implemented at code level in nats-tls-healthcheck.

Resolution

If nats-tls-healthcheck fails rarely due to above reason, it could be due to temporary CPU spike with nats-tls-wrapper process.
If nats-tls-healthcheck fails very often for the same reason, please review resource usage especially CPU on NATS instance. Please scale up accordingly if CPU usage remains at high level.

Feedback

thumb_up Yes

thumb_down No