Note: Currently the director metrics endpoint only exists in Ops Manager versions 2.10 and above.
To alert on unresponsive agents you will need to configure a custom scraper in the Healthwatch 2 tile under the Prometheus configuration and a custom alert under the Healthwatch 2 tile / Alertmanager configuration:
Step 1. SSH to the director VM and run the following command:
bosh/0:~$ curl -ks https://<director ip>:9091/metrics --cacert /var/vcap/jobs/director/config/metrics_server/ca.pem --cert /var/vcap/jobs/director/config/metrics_server/certificate.pem --key /var/vcap/jobs/director/config/metrics_server/private_key.key | grep unresponsive # TYPE bosh_unresponsive_agents gauge # HELP bosh_unresponsive_agents Number of unresponsive agents per deployment bosh_unresponsive_agents{name="cf-*"} 1.0 bosh_unresponsive_agents{name="p-healthwatch2-pas-exporter-*"} 0.0 bosh_unresponsive_agents{name="bosh-health"} 0.0 bosh_unresponsive_agents{name="p-healthwatch2-*"} 0.0
Grab the following from the output from the director:
Step 2. Under the Healthwatch 3 tile -> Prometheus Configuration -> Additional Scrape Config Jobs, click Add. For the TSDB Scrape job:
job_name: director_scrape metrics_path: /metrics scheme: https static_configs: - targets: - "<director ip>:9091"
For the TLS Config Certificate Authority and the TLS Config Certificate and Private Key, paste in the keys and certs you saved above.
Save your changes.
For reference:
Step 3. Under Healthwatch 2 tile -> Alertmanager configuration, append the following rule to the end of your existing alerts (You can rename the name and alert keys to whatever you want):
##### BEGIN CUSTOM DIRECTOR ALERTING RULES ##### - name: CustomDirector rules: - alert: UnresponsiveAgents expr: 'bosh_unresponsive_agents{} > 0' for: 3m annotations: summary: "Unresponsive agents" description: | Unresponsive agents ##### END CUSTOM DIRECTOR ALERTING RULES #####
Click save. Apply changes.
For reference:
Step 4. Optionally you can access the Prometheus and Alertsmanager UIs to test that the alert is working: