Users may want to monitor the reliability of usage services to promptly detect issues.
This article details how to monitor usage services in VMware Tanzu Application Service (TAS) for VMs.
VMware Tanzu Application Service (TAS) for VMs / Elastic Application Runtime
| Endpoint | What It Does |
| GET /heartbeat/failed_jobs | Returns HTTP 500 if failed jobs >= threshold (default 1) |
| GET /heartbeat/workers | Checks whether the worker is alive via check-in |
| GET /heartbeat/db | Checks database connectivity |
| GET /heartbeat/doomsday | Shows days left until purge-and-reseed is required |
| GET /usage_availability | Shows the date up to which usage data is reliable.
|
| GET /usages_availability/services | Shows the timestamp up to which service data is fetched |
| GET /delayed/failed | Returns full JSON details of all failed jobs |
| Metric | What It Tracks |
| usage_service_delayed_job_failures | Count of failed delayed jobs |
| usage_service_doomsday_counter | Days left before data is irrecoverable without purge-and-reseed |
| usage_service_service_usage_event_fetcher_job_exit_code | 0 = success, 1 = failure on last run |
| usage_service_service_usage_event_fetcher_job_processed_events | Number of events processed in last run |
| usage_service_app_usage_event_cc_lag_seconds | Seconds between local latest event and CC's latest event |
If you have Scheduler for TAS (or even a cron-style CF task), you can create a scheduled task that hits the usage_availability endpoint daily and alerts if the date is more than 2 days behind.
#!/bin/bash# usage-availability-check.sh# Schedule this as a CF task dailyTOKEN=$(cf oauth-token)AVAILABILITY_DATE=$(curl -s "https://app-usage.SYSTEM_DOMAIN/usage_availability" \ -k -H "authorization: $TOKEN" | jq -r '.date')TODAY=$(date -u +%Y-%m-%d)DAYS_BEHIND=$(( ($(date -d "$TODAY" +%s) - $(date -d "$AVAILABILITY_DATE" +%s)) / 86400 ))if [ "$DAYS_BEHIND" -gt 2 ]; then echo "ALERT: Usage availability is $DAYS_BEHIND days behind (stuck at $AVAILABILITY_DATE)" # Send alert via webhook/email/Slack as appropriate exit 1fiecho "OK: Usage availability is current ($AVAILABILITY_DATE, $DAYS_BEHIND days behind)"
The /usage_availability date is the min of app rollover, task rollover, and service fetch times. If the service fetcher breaks, this date stops advancing.
#!/bin/bash# failed-jobs-check.shTOKEN=$(cf oauth-token)HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \ "https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \ -k -H "authorization: $TOKEN")if [ "$HTTP_CODE" != "200" ]; then BODY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/failed_jobs" \ -k -H "authorization: $TOKEN") echo "ALERT: Usage service has failed jobs: $BODY" exit 1fiecho "OK: No failed jobs"
This endpoint returns HTTP 500 if any failed jobs exist (default threshold of 1).
#!/bin/bash# doomsday-check.shTOKEN=$(cf oauth-token)DOOMSDAY=$(curl -s "https://app-usage.SYSTEM_DOMAIN/heartbeat/doomsday" \ -k -H "authorization: $TOKEN")DAYS_LEFT=$(echo "$DOOMSDAY" | jq '.days_left')SERVICE_DAYS=$(echo "$DOOMSDAY" | jq '.source_details.failed_ingestion_service_days_left')if [ "$DAYS_LEFT" -lt 14 ]; then echo "CRITICAL: Doomsday counter at $DAYS_LEFT days! Service days left: $SERVICE_DAYS" echo "Full details: $DOOMSDAY" exit 1elif [ "$DAYS_LEFT" -lt 30 ]; then echo "WARNING: Doomsday counter at $DAYS_LEFT days. Service days left: $SERVICE_DAYS" echo "Full details: $DOOMSDAY" exit 1fiecho "OK: Doomsday counter healthy ($DAYS_LEFT days left)"
The doomsday counter tracks how long since the last successful event ingestion. CC purges old events after a configurable number of days (typically 31). If ingestion fails for that long, the events are gone forever and purge-and-reseed is required.
If a metrics pipeline exists (Prometheus, Healthwatch, Datadog, etc.), the Usage Service already exposes metrics at the /metrics endpoint. Set up alerts on the following:
| Alert | Condition | Severity |
| Service fetcher failing | usage_service_service_usage_event_fetcher_job_exit_code == 1 for > 15 min | Critical |
| Failed jobs accumulating | usage_service_delayed_job_failures > 0 | Warning |
| Failed jobs snowballing | usage_service_delayed_job_failures > 100 | Critical |
| Doomsday approaching | usage_service_doomsday_counter < 14 | Critical |
| No events processed | usage_service_service_usage_event_fetcher_job_processed_events == 0 for > 1 hour | Warning |
Monitor the Failed jobs accumulating, or check /heartbeat/failed_jobs endpoint, to ensure failures are not continuing.
If failed jobs persist and the problem cannot be resolved with the above steps, please open a support ticket.