FAQ: Metrics needed in Aria Operations for accurate monitoring and impact analysis a dedicated dashboard in Aria Operations
search cancel

FAQ: Metrics needed in Aria Operations for accurate monitoring and impact analysis a dedicated dashboard in Aria Operations

book

Article ID: 430030

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Creating a dedicated dashboard in Aria Operations, and need to know the recommended metrics in Aria Operations for accurate monitoring and impact analysis? 

Environment

VMware Aria Operations 8.18.x

Resolution

A dedicated dashboard for Impact Analysis and Monitoring in VMware Aria Operations (formerly vRealize Operations) is dependent on the metrics used in the environment.

To accurately gauge impact, you the customer along with your PSO must distinguish between Utilization (how busy) and Contention (impact when busy). Based on this there are some recommended metrics to include, categorized by their function in impact analysis:

    • These are your primary Key Performance Indicators (KPIs):
Object Metric Path Why it matters Threshold to Watch
Virtual Machine `CPU Ready (%)` The % of time the VM wanted CPU but the host couldn't provide it. High ready time causes "lag" inside the guest OS.
Virtual Machine `Memory Contention (%)` Percentage of time the VM is waiting for memory access. More accurate than "Usage" because it accounts for swap/ballooning friction.
Virtual Machine `Disk Latency (ms)` Total latency (Read + Write). This is the single biggest "impact" metric for databases and heavy apps.
Virtual Machine `Network Dropped Packets` Total dropped packets (Tx + Rx). Any value > 0 indicates a network impact/bottleneck.
    • Metrics that identify why the impact is happening. Use these to correlate with the above table.
      • ESXi Host / Cluster Level:
        • CPU | Workload (%): Use "Workload" instead of "Usage." Workload can exceed 100% (indicating demand is higher than capacity), whereas Usage caps at 100%.
        • Memory | Workload (%): Similar to CPU; indicates if the host is overcommitted on active memory.
        • Network | Usage Rate (KBps): Aggregate throughput to identify noisy neighbors saturating the uplink.
      • Datastore Level:
        • Disk Space | Snapshot Space (GB): Crucial for impact analysis.
        • Datastore | Outstanding IO Requests: High outstanding IO often precedes high latency.
    • For proactive impact analysis use the below:
      • Cluster / Host:
        • Capacity Analytics | Time Remaining: Aria’s calculated projection of when resources will run out.
        • Capacity Analytics | Capacity Remaining (%): The buffer you have left before performance degrades.
        • Reclaimable Capacity | Idle VMs: Identifies "used up/not in use" resources that could be causing unnecessary contention for others.
    • When creating the dashboard, organize your widgets to provide the context of:
      • Top Row:
        • Scoreboard Widget: Display the " primary Key Performance Indicators (KPIs)" (CPU Ready, Mem Contention, Disk Latency) for your critical clusters. Color code them (Red/Green) so you can see impact at a glance.
      • Middle Row: "Top 10 VMs by CPU Ready %" and "Top 10 VMs by Total IOPS."
      • Bottom Row: Show Datastores sized by Total Capacity and colored by Latency.

The above are just recommendation as for full customization, please engage your PSO for more detail layouts of the dashboard.