"Guest OS experiencing CPU queue" false positive alert triggering

Products

VMware Aria Suite

Issue/Introduction

This article addresses a common issue where Aria Operations generates excessive "Guest OS experiencing CPU queue" alerts due to low sensitivity thresholds

An alert is triggered on a virtual machine, but there is no performance issue with the workload on the object. The alert occurs when the "Guest|Peak vCPU Queue within collection cycle" metric exceeds a threshold of 10, and the "CPU|Usage (MHz)" metric goes beyond 250.

Environment

Aria Operations 8.18.X

Resolution

As the default threshold of 10 might not be suitable for every environment, follow these steps to resolve the issue:

Increase the threshold for the "Guest|Peak vCPU Queue within collection cycle" metric according to the specific needs of your environment.
Adjust the Wait Cycle and Cancel Cycle values as required.
Adjust the symptom to only trigger when there is sizable CPU utilization. Sizable can be defined as CPU Net Run > 75% .The attached alert definition file replaces the CPU Usage > 250 Mhz with CPU Net Run > 75%.
Please follow below steps to import the attached alert definition:

1 - Configure > Alerts > Alert Definitions > Click on ... > Import

2 - Browse to the location where you stored the attached file

3 - Click on "Overwrite existing Alert Definition"

4 - Click on Import

Note: As the CPU Queue counter is known to have false positives, increase the Wait cycle to 10 minutes. This means the alert will not trigger if it only happens once.

We set the Cancel cycle to 1 so it does not remain there for a long period of time.

The attached alert definition file implements the above changes.

This issue is resolved in VCF Operations 9.0.

Additional Information

The alert works together with the VM CPU Utilization alert and VM CPU Contention alert. Use these 3 alerts together to form an analysis.

If the alert of high VM CPU contention is also triggered, then follow the remediation for this alert.
If the alert of high VM CPU consumption is also triggered, then follow the remediation for this alert.
If the above 2 are false, look at the application on why it’s creating many threads. Compare the values with other software or code that are part of the larger business applications. Also, compare the value with the same software in other business applications.
If the software is a commercial software from IT vendor, ask the vendor for the recommendation.
- If you do not get the answer, exclude this application from this alert, using Aria Operations policy.

The alert tracks if the processes within Windows or Linux are queuing for CPU. The metric measures the number of threads in the processor queue. Unlike Linux, Windows excludes the threads that are running (being executed).

Assuming a VM configured with 8 vCPUs. The Guest OS sees 8 threads so it will schedule up to 8 parallel processes. If there is more demand, it will have to queue them. This means the queue needs to be accounted for in Guest OS sizing. Because it reports the queue, this is the primary counter to measure Guest OS performance. It tells if the CPU is struggling to serve the demand or not.

Windows or Linux utilization may be 100%, but as long as the queue is low, the workload is running as fast as it can. Adding more vCPU will in fact slow down the performance as you have higher chance of context switching

What is a healthy value?

Windows Performance Monitor UI description is not consistent with MSDN documentation (based on Windows Server 2016 documentation). The description shown in Windows UI is “Processor Queue Length is the number of threads in the processor queue. Unlike the disk counters, this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload. A sustained processor queue of less than 10 threads per processor is normally acceptable, dependent of the workload.”

MSDN document states that a sustained processor queue of greater than 2 threads generally indicates processor congestion. SQL Server document states 3 as the threshold.

Having said that, it might be the case that Guest OS showing high CPU Queue without apparent performance issue.

Reference:

The metric is documented in the vSphere Metrics book
The alert is documented in the VCF Operations Transformation book.

Attachments

Guest OS CPU Queue Alert.xml get_app