Understanding 100% Values in 95th Percentile Metrics in Aria Operations dashboard
search cancel

Understanding 100% Values in 95th Percentile Metrics in Aria Operations dashboard

book

Article ID: 427146

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

  • In VMware Aria Operations 8.18.x, a "95th Percentile" metric displaying a value of 100% can be ambiguous. It is not immediately clear if this number indicates a critical system overload or a perfectly healthy environment.
  • Correct interpretation requires checking if the metric measures Resource Usage or a Performance Score. Accurate capacity planning also requires understanding the difference between a "Maximum" peak and a "95th Percentile" trend. 

Resolution

1. Meaning of "100%": Performance vs. Usage

A value of 100% at the 95th Percentile does not always mean the resource is full. The meaning changes depending on the type of metric being used.

  • Performance and Health Scores: If a Cluster or Workload Performance metric shows 100%, this is the ideal state. It confirms the infrastructure is running the workload perfectly without problems.

  • Usage Metrics: If a resource usage metric (like CPU Usage) shows 100%, the resource is reaching its limit frequently. This suggests the system is busy and may need more resources.

2. The Difference Between Maximum and 95th Percentile

Capacity planning requires choosing the right measurement to avoid wasting money on unnecessary hardware.

Maximum (The Peak) The Maximum value is the single highest number recorded.

  • Example: A server hits 90% CPU for 10 seconds during a scheduled backup job, but stays at 40% for the rest of the day.

  • Result: The "Maximum" shows 90%. Buying hardware based on this one short spike leads to wasted capacity.

95th Percentile (The Sustained View) The 95th Percentile ignores the top 5% of highest values (short spikes) to show the normal, sustained usage. 

  • Technical Interpretation: In the backup example above, the 95th percentile would likely report ~40%. It filters out the backup spike because it was a short event, revealing that the server has enough resources for normal operations.

3. Using 95th Percentile for Right-Sizing

The 95th percentile is the standard for right-sizing because it shows the resources needed for normal operations, not just for short-term anomalies.

If a Virtual Machine (VM) shows 40% CPU Usage at the 95th Percentile over one week, it means the VM used 40% or less of its CPU for 95% of the time. Even if it spiked higher for a few minutes, the VM is using less than half its capacity most of the time. This indicates the VM is oversized and can be made smaller.

4. When to Use "Maximum" Values

While percentiles are good for general planning, "Maximum" values are needed to find specific failures. Ignoring the peak value can hide the cause of a problem.

  • Latency: High latency (delays) causes applications to freeze. The maximum value helps identify the exact moment the delay happened.

  • Disk Space: A disk cannot be full for even a short time without causing errors. Storage must always be sized based on the Maximum (Peak) usage.

  • Packet Drops: Any data loss on a network is a failure. The maximum value helps find these errors.

5. Interpreting High KPI Scores (e.g., 99.6%)

Sometimes a metric value appears as 99.6%. As a Performance score, this is a very good result. It means the system worked perfectly for 99.6% of the time. The missing 0.4% represents a very small issue that is unlikely to have significantly impacted the system.

Quick Reference: Metric Selection Guide

Task Recommended Metric Reason
Right-Sizing (CPU / RAM) 95th Percentile Focuses on normal, sustained use rather than short spikes. Prevents wasting money on extra hardware.
Troubleshooting Slowness Maximum Short spikes in latency cause lag. The Maximum value reveals the worst moment.
Disk Space (Storage) Maximum A full disk causes crashes. Storage must handle the absolute highest amount of data.
Network Errors Maximum Any lost connection is bad. Percentiles might hide these small but critical errors.
CPU Ready (Wait Time) Maximum High "Ready" time means the system is frozen. Even short freezes are a problem, so the peak value is important.