Custom metrics are lost from Metric Registrar Endpoint worker

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The customer has the custom application on each Diego Cell that scapes metrics and emits a certain amount of custom metrics to Metric Registrar Endpoint Worker of doppler VMs. The custom application emits a certain amount of dummy metrics. Prometheus would check the emitted amount of a particular dummy metric, and would alert them when the amount received falls below the specified value.

Custom metrics routing path:

metric_registrar_endpoint_worker -> loggr-forwarder-agent -> loggregator_agent -> Doppler -> Reverse Log Proxy -> Reverse Log Proxy Gateway -> Gorouter -> custom nozzle

In this knowledge article base, we will explore how to track the dropped counter metric where the source id is forwarder_agent due to buffer overflows and identify applications with large metric endpoint responses.

Environment

Product Version: 2.7

Resolution

Cause:
The drops are occurring much more on specific dopplers, which might be consistent with the metric endpoint worker on those instances owning a particularly metrics-heavy application.

Balancing is performed for metric registrar scraping, but it is application based (per registration). This means that if a specific application has an outsized number of metrics the endpoint worker handling it will emit an outsized number of metrics.

As a result, those tons of metrics scraped by the specific Endpoint Worker job were sent to the specific Forwarder Agent job and those metrics were dropped because of buffer overflow in the specific Forwarder Agent job.

How to track the dropped counter metric where the source id is forwarder_agent

If the customer is using Prometheus, we should be able to use a query like this:

rate(firehose_counter_event_loggregator_forwarder_agent_dropped_total[5m])

In other method, we can track the dropped counter metric where the source id is forwarder_agent

% find . -name 'loggr-forwarder-agent.stderr*' -print | xargs grep -i dropped --count | awk -F ':' '{print $2 "\t" $1}' | sort -nr | head -n5

34222 ./doppler.******************

8506 ./doppler.******************

964 ./doppler.******************

505 ./doppler.******************

419 ./doppler.******************

How to identify applications with large metric endpoint responses

We should be able to determine the metrics endpoint response sizes from the gorouter access logs (filtering for metric endpoint paths and looking at the Bytes Sent column).

Gorouter generates an access log in the following format when it receives a request:

<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>

Resolution: Short-term mitigation

The number of metrics that their deployment will be capable of handling will be specific to their infrastructure. It's probably best determined experimentally by:

Reducing the volume of metrics
Observing if drops are occurring
If drops are still occurring, continue reducing the volume of metrics