To craft effective Alerts and visualizations of data in Dashboards/Charts starts with "understanding your data shape". This article explains in detail what is fundamentally meant by understanding your data shape.
To understand your data shape, you need to investigate the following:
Typically, all charts default to a 2-hour time window and a ‘Line Plot’. While this is a great view of your recent received metrics it does not clearly show the underlying behavior or "shape" of your metrics.
To get a clear view of your data shape, we suggest that you change the chart type to ‘point plot’ and time window to ’10 minute’ in a ‘live’ view.
In a point plot, each and every point that you see is a metric in its raw form as it was received.
In a 10-minute window, under most scenarios, your points will be separate enough that you can identify the different times they were reported at on the x-axis.
Live view shows you each data point as they are ingested.
What is the reporting interval of your metrics, and does it change?
Now that we are in a 10-minute live view window with a ‘point plot’ selected, we can determine what the reporting interval is. The reporting interval is simply the gap between the reported metrics. Hover over each point to determine the time and the associated gap between the metrics. Keep in mind that the gap may look different if advanced functions are applied in the query, the suggestion is to look at the raw metric.
For example, if your query is align(5m, sum(ts("my.metric"))), then you want to examine the raw metric, or just ts("my.metric").
In most use cases and by default, Observability expects the points to be separated by 1 min. This means that Aria Apps expects a metric to come in each minute or your ‘reporting interval’ to be 60 seconds. However, there can be times when metrics are separated by more than a minute and 10 min live view of point plot will reflect that.
Here are the common reporting interval scenarios:
Understanding your reporting interval helps you determine if you will need additional data functions to craft effectual alerts and visualizations.
Is there a lag in your metrics reporting?
In a 10-minute live view, you will see your metrics come in real time. Observe towards the right part of the chart, as the time window slides forward, you will notice a new metric come in. If your metrics have a reporting interval of 1 minute, the assumption normally is that your metric will come in the next minute. While true in most cases, it is not the reality in some cases which can often lead to misfiring alerts if not properly accounted for with functions.
As an example, let’s examine the following chart:
We can see the gap between the points is 1 minute, however in live view, we are not seeing the most recent metrics to be present. This means that there is a lag in the reporting of metrics.
For effective alerting and dashboard results, users will need to take this lag in reporting into account. Observability has a set of ‘missing data functions’ to help in this scenario.
Are the metrics backfilling?
Another aspect to note is when the metrics come in, do they report in the current minute, or do they report in a previous point in time? A metric that comes in after a delay or "lag" is backfilled into its correct time slot. The timestamp of the metric is honored even if it did not arrive at that time. In the image above, if at 09:18 we ingest a metric that was created at 09:14 the point will populate the chart at 09:14.
Another example let’s say that there is a delay in reporting of about 10 minutes, when the metrics make it into your Observability, they only come in for the first 5 minutes of that lag. In this scenario, your live metrics at the current minute would be missing.
When investigating Alerts that fired or did not fire as expected, users will view data that has been populated after-the-fact, or "backfilled" often assuming that the Alert Engines failed to alert properly.
Alerts do not evaluate backfilled data; alerts only evaluate data that is available at the current moment. Using the methods in this article to understand your data shape will allow you to build alerts with higher confidence that they will fire as expected.
Are your metrics gauges, counters or delta counters?
After considering the reporting interval and any potential delays in the metrics, the other aspect of determining the data shape is to understand what the metric values represent. There are 3 main types of metrics and following is a brief description of them:
Gauge:
Gauge is a metric where the value represents an actual value of the measurement such as memory used at a certain time or the CPU utilization at a specific time. The key to note with gauges is that the value can go up and down and usually is within a range.
Counter:
Counter metric is a metric where the value is expected to be a cumulative of previous values. So, the value either increases or stays the same. There can be a counter reset which represents the value dropped to 0 and is starting to accumulate again from that point forward. As an example, think of system up time, the value of time will keep increasing until there is a system restart or ‘counter reset’.
Other examples would be the number of requests served or the number of errors received. With counter metrics, it is often beneficial to use functions such as rate(), ratediff() or mdiff() as they help calculate the change over time.
Delta Counter:
Delta counters are different from traditional counters as their value represents the delta or the change in value over a specific time. In Aria Apps, Delta Counters bin to a minute timestamp and writes to the same bin are treated as deltas. These are helpful for calculating burst of events as there can be collisions if a traditional gauge or counter metrics are used.
To learn more about Counters and Delta Counters, please see: Cumulative Counters and Delta Counters.
Summary:
To understand your metrics data shape, you need to understand the following aspects:
To get these answers, observe new metrics arriving to understanding of their data shape.
Once you understand your data shape to see a full list of available function to craft effectual alerts and visualizations, please see: Wavefront Query Language Reference
If this information or current documentation does not answer your questions, then please open a Support Ticket for further assistance.