bosh-health-check is showing strange state in Grafana web UI
search cancel

bosh-health-check is showing strange state in Grafana web UI

book

Article ID: 298335

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

With Healthwatch v2 it's often observed that bosh-health-check VM is shown very strange state on VM Health panel of System at a Glance dashboard on Grafana web UI. For example,
  • bosh-health-check VM metric is showing red for most of the time during query period, occasionally empty and green

bosh-health-check-issue.png

  • bosh-health-check VM metric is showing empty in the beginning of query period and green for the rest
bosh-health-check-1hour.png
  • bosh-health-check VM metric is not shown on the panel at all for the query period
bosh-health-check-no-vm.png

If hovering mouse cursor over the information icon at the top-left corner of VM Health panel, the detailed explanation of VM health metric will be shown.
vm-health-1.jpeg
So the VM Health chart is showing the values for the system_healthy metric, which is created by BOSH and reports the state of the processes on the VM.  If everything is running, it emits a value of 1.  If even a single process is stopped, then it reports a value of 0.  The quirk here is that as BOSH is deleting a VM, it tends to emit a an "unhealthy" metric for that VM as it's being deleted.  It's all timing based, and that the metric emission just happens to occur as some of the processes are being stopped.
 

As described in Healthwatch document , The BOSH health metric exporter VM, bosh-health-exporter, creates a BOSH deployment called bosh-health every ten minutes. This BOSH deployment deploys another VM, bosh-health-check, that runs a suite of SLI tests to validate the functionality of the BOSH Director. After the SLI tests are complete, the BOSH health metric exporter VM collects the metrics from the bosh-health-check VM, then deletes the bosh-health deployment and the bosh-health-check VM.

Therefore there are opportunities to hit the quirk as bosh-health-check VM is created/deleted, which causes weird symptom of bosh-health-check VM metric showing on the panel. However the bosh-health-check VM metric value displayed in VM Health has no correlation with the result of the BOSH Health Check SLI test. It doesn't reflect BOSH director health state either. In fact it's BOSH Health Check panel that should be examined instead for BOSH director health state. 


Environment

Product Version: 2.11

Resolution

Product team will make improvement in future release so that the System At a Glance dashboard in the Grafana UI does not show metrics for bosh-health-check VM. 

There is also a workaround which is to clone the System at a Glance dashboard and update the query of VM Health panel to the below one to remove bosh-health-check from the panel. 
min by (exported_job) (system_healthy{origin="bosh-system-metrics-forwarder",exported_job!~"compilation-.*|bosh-health-check"})