This article addresses a common issue where Aria Operations generates excessive "Guest OS experiencing CPU queue" alerts due to low sensitivity thresholds
An alert is triggered on a virtual machine, but there is no performance issue with the workload on the object. The alert occurs when the "Guest|Peak vCPU Queue within collection cycle" metric exceeds a threshold of 10, and the "CPU|Usage (MHz)" metric goes beyond 250.
Aria Operations 8.18.X
As the default threshold of 10 might not be suitable for every environment, follow these steps to resolve the issue:
1 - Configure > Alerts > Alert Definitions > Click on ... > Import
2 - Browse to the location where you stored the attached file
3 - Click on "Overwrite existing Alert Definition"
4 - Click on Import
Note: As the CPU Queue counter is known to have false positives, increase the Wait cycle to 10 minutes. This means the alert will not trigger if only happens once.
We set the Cancel cycle to 1 so it does not remain there for a long period of time.
The attached alert definition file implements the above changes.
The alert works together with the VM CPU Utilization alert and VM CPU Contention alert. Use these 3 alerts together to form an analysis.
The alert tracks if the processes within Windows or Linux are queuing for CPU. The metric measures the number of threads in the processor queue. Unlike Linux, Windows excludes the threads that are running (being executed).
Assuming a VM configured with 8 vCPUs. The Guest OS sees 8 threads so it will schedule up to 8 parallel processes. If there is more demand, it will have to queue them. This means the queue needs to be accounted for in Guest OS sizing. Because it reports the queue, this is the primary counter to measure Guest OS performance. It tells if the CPU is struggling to serve the demand or not.
Windows or Linux utilization may be 100%, but as long as the queue is low, the workload is running as fast as it can. Adding more vCPU will in fact slow down the performance as you have higher chance of context switching
What is a healthy value?
Windows Performance Monitor UI description is not consistent with MSDN documentation (based on Windows Server 2016 documentation). The description shown in Windows UI is “Processor Queue Length is the number of threads in the processor queue. Unlike the disk counters, this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload. A sustained processor queue of less than 10 threads per processor is normally acceptable, dependent of the workload.”
MSDN document states that a sustained processor queue of greater than 2 threads generally indicates processor congestion. SQL Server document states 3 as the threshold.
Having said that, it might be the case that Guest OS showing high CPU Queue without apparent performance issue.
Reference: