Healthwatch2 keeps throwing TanzuSLOCanaryAppErrorBudget alert though Canary apps are running well
search cancel

Healthwatch2 keeps throwing TanzuSLOCanaryAppErrorBudget alert though Canary apps are running well

book

Article ID: 293772

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Healthwatch2 keeps throwing TanzuSLOCanaryAppErrorBudget alert even though Canary apps are configured and running well. 
canary-apps.png

Following are descriptions about this alert.
Summary:
Your Error Budget for your Canary URLs is below zero Description: This alert fires when your error budget for your Canary URLs is below zero. If your Canary URLs are representative of other running applications, this could indicate that your end users are affected.
Recommended troubleshooting steps:
Check to see if your canary app(s) are running. Then check your foundation's networking, capacity, and VM health. Common labels across firing alerts:
alertname: TanzuSLOCanaryAppErrorBudget
job: blackbox_exporter
prometheus_deployment: p-healthwatch2-d1383db2a98278b24ade 


Usually users would configure Healthwatch2 with alerting rules with the sample as mentioned in documentation . And the default alerting rule for TanzuSLOCanaryAppErrorBudget is shown below, which means system will check the average of the probe_success metric for 28 days range and compares it with 0.999, if it is less than 0.999 the alert will fire.
      - alert: TanzuSLOCanaryAppErrorBudget
        expr: "( (avg_over_time(probe_success[28d]) - 0.999) * (28 * 24 * 60) ) <= 0"
        for: 10m
        annotations:
          summary: "Your Error Budget for your Canary URLs is below zero"
          description: |
            This alert fires when your error budget for your Canary URLs is below zero.
            If your Canary URLs are representative of other running applications, this could indicate that your end users are affected.

            Recommended troubleshooting steps:
            Check to see if your canary app(s) are running. Then check your foundation's networking, capacity, and VM health.
Probably some canary URLs had become inaccessible during the past days which caused some probe_success metric values to be 0. Even later access to those failing canary URLs was resumed, it will not immediately affect on the average of 28 days period. The alert will continue to fire as long as the alert expression is true.

Environment

Product Version: 2.10

Resolution

With the above explanation it's expected to keep receiving TanzuSLOCanaryAppErrorBudget alert for some days after the access to canary URLs is recovered. Since the expression in the sample calculates the average of 28 days, which might not meet every user's expectation, it can be changed to a shorter period (e.g. 7d) in the expression in order to get the alarm cleared sooner. Refer to the documentation about how to configure the alerting rules on Healthwatch2 tile Settings page. For example, the expression can be modified to calculate the average of 7 days as shown below. 
expr: "( (avg_over_time(probe_success[7d]) - 0.999) * (7 * 24 * 60) ) <= 0"