Alerting and Triaging Status of Usage Service
search cancel

Alerting and Triaging Status of Usage Service

book

Article ID: 433534

calendar_today

Updated On:

Products

VMware Tanzu Platform Core VMware Tanzu Platform - Cloud Foundry VMware Tanzu Application Service

Issue/Introduction

Users may want to monitor the reliability of usage services to promptly detect issues.

This article details how to monitor usage services in VMware Tanzu Application Service (TAS) for VMs.

Environment

VMware Tanzu Application Service (TAS) for VMs / Elastic Application Runtime

Resolution

Usage Service provides endpoints and prometheus metrics to monitor health and availability:

Usage Service Endpoints:

EndpointWhat It Does
GET /heartbeat/failed_jobsReturns HTTP 500 if failed jobs >= threshold (default 1)
GET /heartbeat/workersChecks whether the worker is alive via check-in
GET /heartbeat/dbChecks database connectivity
GET /heartbeat/doomsdayShows days left until purge-and-reseed is required
GET /usage_availability

Shows the date up to which usage data is reliable.

  • Healthy: The date should be yesterday (or at most the day before yesterday if checked around midnight before the 02:00 rollover).

  • Broken: The date is more than two days behind today.

GET /usages_availability/services
Shows the timestamp up to which service data is fetched
GET /delayed/failedReturns full JSON details of all failed jobs

 

Existing Prometheus Metrics:

MetricWhat It Tracks
usage_service_delayed_job_failuresCount of failed delayed jobs
usage_service_doomsday_counterDays left before data is irrecoverable without purge-and-reseed
usage_service_service_usage_event_fetcher_job_exit_code0 = success, 1 = failure on last run
usage_service_service_usage_event_fetcher_job_processed_eventsNumber of events processed in last run
usage_service_app_usage_event_cc_lag_secondsSeconds between local latest event and CC's latest event

 

Recommended Guardrails to Set Up

1. CF Scheduler Task: Daily /usage_availability Date Staleness Check

If you have Scheduler for TAS (or even a cron-style CF task), you can create a scheduled task that hits the usage_availability endpoint daily and alerts if the date is more than 2 days behind.

#!/bin/bash
# usage-availability-check.sh
# Schedule this as a CF task daily
TOKEN=$(cf oauth-token)
AVAILABILITY_DATE=$(curl -s "https://app-usage.SYSTEM_DOMAIN/usage_availability" \
  -k -H "authorization: $TOKEN" | jq -r '.date')
TODAY=$(date -u +%Y-%m-%d)
DAYS_BEHIND=$(( ($(date -d "$TODAY" +%s) - $(date -d "$AVAILABILITY_DATE" +%s)) / 86400 ))
if [ "$DAYS_BEHIND" -gt 2 ]; then
  echo "ALERT: Usage availability is $DAYS_BEHIND days behind (stuck at $AVAILABILITY_DATE)"
  # Send alert via webhook/email/Slack as appropriate
  exit 1
fi
echo "OK: Usage availability is current ($AVAILABILITY_DATE, $DAYS_BEHIND days behind)"

What Problem this catches:

The /usage_availability date is the min of app rollover, task rollover, and service fetch times. If the service fetcher breaks, this date stops advancing. 

 

2. CF Scheduler Task: Daily /heartbeat/failed_jobs Check

#!/bin/bash
# failed-jobs-check.sh
TOKEN=$(cf oauth-token)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \
  -k -H "authorization: $TOKEN")
if [ "$HTTP_CODE" != "200" ]; then
  BODY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \
    -k -H "authorization: $TOKEN")
  echo "ALERT: Usage service has failed jobs: $BODY"
  exit 1
fi
echo "OK: No failed jobs"

What problem this catches

This endpoint returns HTTP 500 if any failed jobs exist (default threshold of 1). 

 

3. CF Scheduler Task: Daily /heartbeat/doomsday Check

#!/bin/bash
# doomsday-check.sh
TOKEN=$(cf oauth-token)
DOOMSDAY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/doomsday" \
  -k -H "authorization: $TOKEN")
DAYS_LEFT=$(echo "$DOOMSDAY" | jq '.days_left')
SERVICE_DAYS=$(echo "$DOOMSDAY" | jq '.source_details.failed_ingestion_service_days_left')
if [ "$DAYS_LEFT" -lt 14 ]; then
  echo "CRITICAL: Doomsday counter at $DAYS_LEFT days! Service days left: $SERVICE_DAYS"
  echo "Full details: $DOOMSDAY"
  exit 1
elif [ "$DAYS_LEFT" -lt 30 ]; then
  echo "WARNING: Doomsday counter at $DAYS_LEFT days. Service days left: $SERVICE_DAYS"
  echo "Full details: $DOOMSDAY"
  exit 1
fi
echo "OK: Doomsday counter healthy ($DAYS_LEFT days left)"

What problem this catches: 

The doomsday counter tracks how long since the last successful event ingestion. CC purges old events after a configurable number of days (typically 31). If ingestion fails for that long, the events are gone forever and purge-and-reseed is required. 

 

4. Prometheus/Grafana Alerting (If Available)

If a metrics pipeline exists (Prometheus, Healthwatch, Datadog, etc.), the Usage Service already exposes metrics at the /metrics endpoint. Set up alerts on the following:

AlertConditionSeverity
Service fetcher failingusage_service_service_usage_event_fetcher_job_exit_code == 1 for > 15 minCritical
Failed jobs accumulatingusage_service_delayed_job_failures > 0Warning
Failed jobs snowballingusage_service_delayed_job_failures > 100Critical
Doomsday approachingusage_service_doomsday_counter < 14Critical
No events processedusage_service_service_usage_event_fetcher_job_processed_events == 0 for > 1 hourWarning 

 

Any issue is found through alerting — what to do next?

  1. Restart the usage-service-scheduler app
    • The scheduler might have found an issue and is stuck on a delayed job.
  2. Restart the usage-service-worker app
    • This worker is responsible for starting asynchronous delayed jobs, and it may have become stuck.

Monitor the Failed jobs accumulating, or check /heartbeat/failed_jobs endpoint, to ensure failures are not continuing.

If failed jobs persist and the problem cannot be resolved with the above steps, please open a support ticket.