We have configured an alarm to trigger for a Windows server if CPU breaches the threshold of 98% and time over threshold has been set as 15 out of 16 minutes.
We could see that the alarm was triggered with an incorrect metric value.
For example: Total CPU Usage on Total for server is at 35.36 %. It has violated threshold of 98.00 percent
The current value is showing as 35.36% whereas the threshold level is 98%.
These alarms are a little bit confusing, but this is actually working as expected. When the alarm is sent, it checks the current (most recently collected) value of the CPU and this is put into the alarm - the most recent sample. But the alarm indicates that for 15 of the last 16 minutes, the threshold was breached - that does not necessarily mean it is breached at the current minute.
For example - just to make it simple - imagine that we have a threshold set for 98 percent, and the time window is for 5 of the last 6 minutes.
Then suppose we have the following sample values:
minute 1: 99
minute 2: 99
minute 3: 99
minute 4: 99
minute 5: 99
minute 6: 25
In this case the alarm would be sent at minute 6, but it would say "Total CPU usage is at 25%. It has violated the threshold of 99 percent for 5 of the last 6 minutes."
This causes some confusion, because both statements are true: The current usage is at 25% but the threshold was violated for 5 of the last 6 minutes -- it just wasn't violated in the 6th minute, so this value appears lower than expected.
It can be helpful to think of the alarm thusly:
"Total CPU usage on server X has violated the threshold of 98% for 15 of 16 minutes. The current value at alarm time is 35.36%."