Scrape and Alert on Metrics from the Bosh Director with Healthwatch 2
search cancel

Scrape and Alert on Metrics from the Bosh Director with Healthwatch 2

book

Article ID: 293797

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

How do I scrape and alert on metrics from the bosh director using Healthwatch 2?

For example, I want to alert on any unresponsive agents that the bosh director is reporting.

Environment

Product Version: 2.10

Resolution

Note: Currently the director metrics endpoint only exists in Ops Manager versions 2.10 and above.


To alert on unresponsive agents you will need to configure a custom scraper in the Healthwatch 2 tile under the Prometheus configuration and a custom alert under the Healthwatch 2 tile / Alertmanager configuration:

Step 1. SSH to the director VM and run the following command:

bosh/0:~$ curl -ks https://<director ip>:9091/metrics --cacert /var/vcap/jobs/director/config/metrics_server/ca.pem --cert /var/vcap/jobs/director/config/metrics_server/certificate.pem --key /var/vcap/jobs/director/config/metrics_server/private_key.key | grep unresponsive

# TYPE bosh_unresponsive_agents gauge
# HELP bosh_unresponsive_agents Number of unresponsive agents per deployment
bosh_unresponsive_agents{name="cf-*"} 1.0
bosh_unresponsive_agents{name="p-healthwatch2-pas-exporter-*"} 0.0
bosh_unresponsive_agents{name="bosh-health"} 0.0
bosh_unresponsive_agents{name="p-healthwatch2-*"} 0.0


Grab the following from the output from the director:
 

  1. The ca, cert, and key from the command above. This will be used in setting up the custom scraper.
  2. The query that will be used for the custom alert: bosh_unresponsive_agents{name="<cf-deployment>"}
  • Note: I have filtered the query output from above with grep. Remove the grep to see what other queries can be used for custom alerting.

 

Step 2. Under the Healthwatch 3 tile -> Prometheus Configuration -> Additional Scrape Config Jobs, click Add. For the TSDB Scrape job:

job_name: director_scrape
metrics_path: /metrics
scheme: https
static_configs:
- targets:
- "<director ip>:9091"


For the TLS Config Certificate Authority and the TLS Config Certificate and Private Key, paste in the keys and certs you saved above.

Save your changes.

For reference:

https://techdocs.broadcom.com/us/en/vmware-tanzu/platform-services/healthwatch-for-vmware-tanzu/2-3/healthwatch/configuring-configuring-healthwatch.html


Step 3. Under Healthwatch 2 tile -> Alertmanager configuration, append the following rule to the end of your existing alerts (You can rename the name and alert keys to whatever you want):

##### BEGIN CUSTOM DIRECTOR ALERTING RULES #####
  - name: CustomDirector
    rules:
      - alert: UnresponsiveAgents
        expr: 'bosh_unresponsive_agents{} > 0'
        for: 3m
        annotations:
          summary: "Unresponsive agents"
          description: |
            Unresponsive agents
  ##### END CUSTOM DIRECTOR ALERTING RULES #####


Click save. Apply changes.

For reference: 


Step 4. Optionally you can access the Prometheus and Alertsmanager UIs to test that the alert is working:

https://techdocs.broadcom.com/us/en/vmware-tanzu/platform-services/healthwatch-for-vmware-tanzu/2-3/healthwatch/troubleshooting.html