Nats-tls-healthcheck job on NATS instance restarts intermittently, though the check target nats-tls-wrapper is not reported with any failure.
In bpm.log, nats-tls-healthcheck is reported not running.
[UTC May 11 14:16:51] error : 'nats-tls-healthcheck' process is not running
[UTC May 11 14:16:51] info : 'nats-tls-healthcheck' trying to restart
[UTC May 11 14:16:51] info : 'nats-tls-healthcheck' start: /var/vcap/jobs/bpm/bin/bpm
[UTC May 11 14:16:52] info : 'nats-tls-healthcheck' process is running with pid 51541
[UTC May 24 09:11:33] error : 'nats-tls-healthcheck' process is not running
[UTC May 24 09:11:33] info : 'nats-tls-healthcheck' trying to restart
[UTC May 24 09:11:33] info : 'nats-tls-healthcheck' start: /var/vcap/jobs/bpm/bin/bpm
[UTC May 24 09:11:34] info : 'nats-tls-healthcheck' process is running with pid 54912
Just before that (around 5~10 seconds ago), healthcheck.stderr.log reports error "failed to connect to NATS server: <detailed reason>"
2024/05/11 14:16:41 failed to connect to NATS server: nats: no servers available for connection
2024/05/24 09:11:24 failed to connect to NATS server: nats: no servers available for connection
TAS
Nats-tls-healthcheck starts a NATS connection to local nats-tls-wrapper at port 4224 every 10 seconds. In the case it fails to establish a connection, nats-tls-healthcheck logs "failed to connect to NATS server: <detailed error>" and exits. Monit daemon will detect the process termination immediately and start nats-tls-healthcheck job.This is by design and implemented at code level in nats-tls-healthcheck.
If nats-tls-healthcheck fails rarely due to above reason, it could be due to temporary CPU spike with nats-tls-wrapper process.
If nats-tls-healthcheck fails very often for the same reason, please review resource usage especially CPU on NATS instance. Please scale up accordingly if CPU usage remains at high level.