Investigating suspected duplicate metric ingestion.

Products

DX OpenExplore

Issue/Introduction

Elevated PPS figures not aligning with what typical rate of ingestion should be. PPS rates spike caused by duplicate data points. Reviewing the rate of ingestion from a specific metric or namespace explorer, higher than expected Points per Second (PPS) are seen.

Cause

A configuration in the metric pipeline is sending metric data with duplicate timestamps which can cause an overwrite of the existing data point on the back end and the higher PPS is recorded as the collector is taking account of the multiple points being sent.

Particular metrics (not in histogram form) are being sent at a frequency higher than one second granularity and are being overwritten.

Resolution

Check for duplicate points being ingested into Observability.

1) Use the namespace explorer to examine the list of metrics which seem to be generating a higher PPS rate than expected. For example, see trend in screenshot attached which shows the PCF namespace as one of the Top 10 namespaces.

2) Doing a count on the level 3 namespace highlights an anomaly with this ingestion pattern as shown in charts below:

3) To check if the problem is because of duplicate overwriting metrics, first capture a sample set of data using the API spy with the focus on the problematic namespace and store the captured metrics in a text file. See example below:

curl -X GET --header "Authorization: Bearer <API TOKEN>" "https://<wavefront-clustername>.wavefront.com/api/spy/points?sampling=1.0&metric=metricname" >> metricname.txt

Example:
curl -X GET --header "Authorization: Bearer <API TOKEN>" "https://<wavefront-clustername>.wavefront.com/api/spy/points?sampling=1.0&metric=pcf" >> pcfmetrics.txt

As the PPS rate for the namespace can be quite large the collection should be cancelled after approximately ten seconds to prevent the file becoming too large.

4) Edit the file to ensure the final line is not incomplete. Remove any partially complete finial metric data entry line so the script has complete metric data points to work with.

5) Run the script, spy_duplicates.py (See Internal Note) giving the sample data filename as the input and another final to send the results to.

For example:

python3 spy_duplicates.py -i pcfmetrics.txt duplicate-data-points.txt

------ START ------

Number of timestamps:14
Total metric points:41516
Total duplicate metric points:27307
Percentage of dups out of total:39.6771428156285
timestamp found: 14
('pcf.gorouter.route_lookup_time.ns', 9165)
('pcf.gorouter.latency.ms', 8850)
('pcf.gorouter.latency.route-emitter.ms', 8815)
('pcf.container.rep.absolute_usage.nanoseconds', 800)
('pcf.container.rep.disk_bytes_quota', 792)
('pcf.container.rep.container_age.nanoseconds', 782)

As can be seen from results above approximately 39% of the data from this particular namespace are showing up as duplicate entries. This example was due to a known problem with the pcf nozzle which has since been resolved.

Note: The preceding log excerpts/Messages are only examples. Date, time, and environmental variables may vary depending on your environment.

Additional Information

Wavefront's powerful SPY API