How to troubleshoot HealthWatch Health Status faiulures
search cancel

How to troubleshoot HealthWatch Health Status faiulures

book

Article ID: 413952

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

When checking HealthWatch grafana Health Status charts, you may see the line going from healthy to unhealthy, very quickly, as in the below screenshot.

Health Status line will go down whenever any process monitored by monit is not running. In this case, the process restarts very quickly, so when you do "monit summary" in the VM, everything will likely be up and running, so you don't know what failed from that output. E.g 

monit summary
The Monit daemon 5.2.5 uptime: 9d 23h 57m

Process 'cloud_controller_clock'    running
Process 'cc_deployment_updater'     running
Process 'loggregator_agent'         running
Process 'loggr-syslog-agent'        running
Process 'loggr-forwarder-agent'     running
Process 'loggr-system-metric-scraper' running
Process 'leadership-election'       running
Process 'loggr-syslog-binding-cache' running
Process 'prom_scraper'              running
Process 'metric_registrar_orchestrator' running
Process 'statsd_injector'           running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running
Process 'system-metrics-agent'      running
Process 'otel-collector'            running
System 'system_8ffe74c5-ef64-4ee6-9fae-79e1a374bf04' running

 

Environment

All versions.

Resolution

Instead of going through every log folder in /var/vcap/sys/log to find out which is the one failing, you can just check /var/vcap/monit/monit.log and there you will see which process is being restarted. E.g

[UTC Oct 23 14:07:30] error    : 'otel-collector' process is not running
[UTC Oct 23 14:07:30] info     : 'otel-collector' trying to restart
[UTC Oct 23 14:07:30] info     : 'otel-collector' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Oct 23 14:07:41] info     : 'otel-collector' process is running with pid 1319764
[UTC Oct 23 14:25:22] error    : 'otel-collector' process is not running
[UTC Oct 23 14:25:22] info     : 'otel-collector' trying to restart
[UTC Oct 23 14:25:22] info     : 'otel-collector' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Oct 23 14:25:34] info     : 'otel-collector' process is running with pid 1332586

In the case above, it was otel-collector process which was crashing periodically and starting again very quickly. Once you know which process was restarted, you can go to log folder (E.g  /var/vcap/sys/log/otel-collector/) and check what is going on.