When checking HealthWatch grafana Health Status charts, you may see the line going from healthy to unhealthy, very quickly, as in the below screenshot.
Health Status line will go down whenever any process monitored by monit is not running. In this case, the process restarts very quickly, so when you do "monit summary" in the VM, everything will likely be up and running, so you don't know what failed from that output. E.g
monit summary
The Monit daemon 5.2.5 uptime: 9d 23h 57m
Process 'cloud_controller_clock' running
Process 'cc_deployment_updater' running
Process 'loggregator_agent' running
Process 'loggr-syslog-agent' running
Process 'loggr-forwarder-agent' running
Process 'loggr-system-metric-scraper' running
Process 'leadership-election' running
Process 'loggr-syslog-binding-cache' running
Process 'prom_scraper' running
Process 'metric_registrar_orchestrator' running
Process 'statsd_injector' running
Process 'bosh-dns' running
Process 'bosh-dns-resolvconf' running
Process 'bosh-dns-healthcheck' running
Process 'system-metrics-agent' running
Process 'otel-collector' running
System 'system_8ffe74c5-ef64-4ee6-9fae-79e1a374bf04' running
All versions.
Instead of going through every log folder in /var/vcap/sys/log to find out which is the one failing, you can just check /var/vcap/monit/monit.log and there you will see which process is being restarted. E.g
[UTC Oct 23 14:07:30] error : 'otel-collector' process is not running
[UTC Oct 23 14:07:30] info : 'otel-collector' trying to restart
[UTC Oct 23 14:07:30] info : 'otel-collector' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Oct 23 14:07:41] info : 'otel-collector' process is running with pid 1319764
[UTC Oct 23 14:25:22] error : 'otel-collector' process is not running
[UTC Oct 23 14:25:22] info : 'otel-collector' trying to restart
[UTC Oct 23 14:25:22] info : 'otel-collector' start: /var/vcap/jobs/bpm/bin/bpm
[UTC Oct 23 14:25:34] info : 'otel-collector' process is running with pid 1332586
In the case above, it was otel-collector process which was crashing periodically and starting again very quickly. Once you know which process was restarted, you can go to log folder (E.g /var/vcap/sys/log/otel-collector/) and check what is going on.