How to monitor usage-service-scheduler metrics for improved reliability of Tanzu Usage Service
search cancel

How to monitor usage-service-scheduler metrics for improved reliability of Tanzu Usage Service

book

Article ID: 431731

calendar_today

Updated On:

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

Usage Service sometimes faces data reliability and corruption leading to need to Purge & Reseed the usage database. This KB explains how to monitor usage-service-scheduler metrics in order to prevent any corruption of usage data.

Resolution

The usage-service-scheduler application running on the system has a worker_check_in. The endpoint is GET /heartbeat/workers

  • Schedule a task or configure a load balancer to monitor this endpoint
    • returns 200 OK no need for an alert
    • returns 500 Internal Server Error, check the worker
  • Monitor "Doomsday" Counter

The service calculates a "Doomsday" metric, which predicts when a "Purge and Reseed" will become mandatory due to data expiration.

  • Endpoint: GET /heartbeat/doomsday (JSON) or via /metrics (usage_service_doomsday_counter).
  • Metric: usage_service_doomsday_counter
  • Alert if this value drops below 5 days. This gives you time to fix the worker or offset before data is permanently lost from Cloud Controller's retention window

Examples:

Worker health:

% curl -k https://app-usage.<SYSTEM DOMAIN>/heartbeat/workers
ok%                                                                                                                                                                                                                              

Doomsday counter:

% curl -k https://app-usage.<SYSTEM DOMAIN>/heartbeat/doomsday

{"days_left":31,"source_details":{"failed_ingestion_app_days_left":31,"failed_ingestion_service_days_left":31}}%