If hovering mouse cursor over the information icon at the top-left corner of VM Health panel, the detailed explanation of VM health metric will be shown.
So the VM Health chart is showing the values for the system_healthy metric, which is created by BOSH and reports the state of the processes on the VM. If everything is running, it emits a value of 1. If even a single process is stopped, then it reports a value of 0. The quirk here is that as BOSH is deleting a VM, it tends to emit a an "unhealthy" metric for that VM as it's being deleted. It's all timing based, and that the metric emission just happens to occur as some of the processes are being stopped.
As described in Healthwatch document , The BOSH health metric exporter VM, bosh-health-exporter
, creates a BOSH deployment called bosh-health
every ten minutes. This BOSH deployment deploys another VM, bosh-health-check
, that runs a suite of SLI tests to validate the functionality of the BOSH Director. After the SLI tests are complete, the BOSH health metric exporter VM collects the metrics from the bosh-health-check
VM, then deletes the bosh-health
deployment and the bosh-health-check
VM.
Therefore there are opportunities to hit the quirk as bosh-health-check
VM is created/deleted, which causes weird symptom of bosh-health-check
VM metric showing on the panel. However the bosh-health-check
VM metric value displayed in VM Health has no correlation with the result of the BOSH Health Check SLI test. It doesn't reflect BOSH director health state either. In fact it's BOSH Health Check panel that should be examined instead for BOSH director health state.
min by (exported_job) (system_healthy{origin="bosh-system-metrics-forwarder",exported_job!~"compilation-.*|bosh-health-check"})