Alerting and Triaging Status of Usage Service

Products

VMware Tanzu Platform Core VMware Tanzu Platform - Cloud Foundry VMware Tanzu Application Service

Issue/Introduction

Users may want to monitor the reliability of usage services to promptly detect issues.

This article details how to monitor usage services in VMware Tanzu Application Service (TAS) for VMs.

Environment

VMware Tanzu Application Service (TAS) for VMs / Elastic Application Runtime

Resolution

Usage Service provides endpoints and prometheus metrics to monitor health and availability:

Usage Service Endpoints:

Endpoint	What It Does
GET /heartbeat/failed_jobs	Returns HTTP 500 if failed jobs >= threshold (default 1)
GET /heartbeat/workers	Checks whether the worker is alive via check-in
GET /heartbeat/db	Checks database connectivity
GET /heartbeat/doomsday	Shows days left until purge-and-reseed is required
GET /usage_availability	Shows the date up to which usage data is reliable. Healthy: The date should be yesterday (or at most the day before yesterday if checked around midnight before the 02:00 rollover). Broken: The date is more than two days behind today.
GET /usages_availability/services	Shows the timestamp up to which service data is fetched
GET /delayed/failed	Returns full JSON details of all failed jobs

Existing Prometheus Metrics:

Metric	What It Tracks
usage_service_delayed_job_failures	Count of failed delayed jobs
usage_service_doomsday_counter	Days left before data is irrecoverable without purge-and-reseed
usage_service_service_usage_event_fetcher_job_exit_code	0 = success, 1 = failure on last run
usage_service_service_usage_event_fetcher_job_processed_events	Number of events processed in last run
usage_service_app_usage_event_cc_lag_seconds	Seconds between local latest event and CC's latest event

Recommended Guardrails to Set Up

1. CF Scheduler Task: Daily /usage_availability Date Staleness Check

If you have Scheduler for TAS (or even a cron-style CF task), you can create a scheduled task that hits the usage_availability endpoint daily and alerts if the date is more than 2 days behind.

#!/bin/bash
# usage-availability-check.sh
# Schedule this as a CF task daily
TOKEN=$(cf oauth-token)
AVAILABILITY_DATE=$(curl -s "https://app-usage.SYSTEM_DOMAIN/usage_availability" \
-k -H "authorization: $TOKEN" | jq -r '.date')
TODAY=$(date -u +%Y-%m-%d)
DAYS_BEHIND=$(( ($(date -d "$TODAY" +%s) - $(date -d "$AVAILABILITY_DATE" +%s)) / 86400 ))
if [ "$DAYS_BEHIND" -gt 2 ]; then
echo "ALERT: Usage availability is $DAYS_BEHIND days behind (stuck at $AVAILABILITY_DATE)"
# Send alert via webhook/email/Slack as appropriate
exit 1
fi
echo "OK: Usage availability is current ($AVAILABILITY_DATE, $DAYS_BEHIND days behind)"

What Problem this catches:

The /usage_availability date is the min of app rollover, task rollover, and service fetch times. If the service fetcher breaks, this date stops advancing.

2. CF Scheduler Task: Daily /heartbeat/failed_jobs Check

#!/bin/bash
# failed-jobs-check.sh
TOKEN=$(cf oauth-token)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \
-k -H "authorization: $TOKEN")
if [ "$HTTP_CODE" != "200" ]; then
BODY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \
-k -H "authorization: $TOKEN")
echo "ALERT: Usage service has failed jobs: $BODY"
exit 1
fi
echo "OK: No failed jobs"

What problem this catches

This endpoint returns HTTP 500 if any failed jobs exist (default threshold of 1).

3. CF Scheduler Task: Daily /heartbeat/doomsday Check

#!/bin/bash
# doomsday-check.sh
TOKEN=$(cf oauth-token)
DOOMSDAY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/doomsday" \
-k -H "authorization: $TOKEN")
DAYS_LEFT=$(echo "$DOOMSDAY" | jq '.days_left')
SERVICE_DAYS=$(echo "$DOOMSDAY" | jq '.source_details.failed_ingestion_service_days_left')
if [ "$DAYS_LEFT" -lt 14 ]; then
echo "CRITICAL: Doomsday counter at $DAYS_LEFT days! Service days left: $SERVICE_DAYS"
echo "Full details: $DOOMSDAY"
exit 1
elif [ "$DAYS_LEFT" -lt 30 ]; then
echo "WARNING: Doomsday counter at $DAYS_LEFT days. Service days left: $SERVICE_DAYS"
echo "Full details: $DOOMSDAY"
exit 1
fi
echo "OK: Doomsday counter healthy ($DAYS_LEFT days left)"

What problem this catches:

The doomsday counter tracks how long since the last successful event ingestion. CC purges old events after a configurable number of days (typically 31). If ingestion fails for that long, the events are gone forever and purge-and-reseed is required.

4. Prometheus/Grafana Alerting (If Available)

If a metrics pipeline exists (Prometheus, Healthwatch, Datadog, etc.), the Usage Service already exposes metrics at the /metrics endpoint. Set up alerts on the following:

Alert	Condition	Severity
Service fetcher failing	usage_service_service_usage_event_fetcher_job_exit_code == 1 for > 15 min	Critical
Failed jobs accumulating	usage_service_delayed_job_failures > 0	Warning
Failed jobs snowballing	usage_service_delayed_job_failures > 100	Critical
Doomsday approaching	usage_service_doomsday_counter < 14	Critical
No events processed	usage_service_service_usage_event_fetcher_job_processed_events == 0 for > 1 hour	Warning

Any issue is found through alerting — what to do next?

Restart the usage-service-scheduler app
- The scheduler might have found an issue and is stuck on a delayed job.
Restart the usage-service-worker app
- This worker is responsible for starting asynchronous delayed jobs, and it may have become stuck.

Monitor the Failed jobs accumulating, or check /heartbeat/failed_jobs endpoint, to ensure failures are not continuing.

If failed jobs persist and the problem cannot be resolved with the above steps, please open a support ticket.