Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison

search cancel

Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison

book

Article ID: 310440

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Virtual machines depend on available host resources (CPU, Memory), and the guest operating system consumes those resources. A problem with resource availability or scheduling inside or outside the virtual machine may cause it to become unresponsive.

This article provides steps for using CPU performance metrics to determine whether a guest operating system is actually running, whether the virtual machine monitor (VMM) is running, or whether there is scheduling contention.

Note: This article is part of a series. For more information, see the parent article Troubleshooting a virtual machine that has stopped responding (1007819).

Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Resolution

Four virtual machine CPU performance metrics can be used together to gain insight into the responsiveness of a virtual machine or its Guest OS:

Run - Amount of time the virtual machine is consuming CPU resources.
Wait - Amount of time the virtual machine is waiting for a VMkernel resource.
Ready - Amount of time the virtual machine was ready to run, waiting in a queue to be scheduled.
Co-Stop - Amount of time a SMP virtual machine was ready to run, but incurred delay due to co-vCPU scheduling contention.

These performance metrics can be reviewed using the Performance tab in the vSphere Client or using the esxtop or resxtop command-line utilities. Choose the most appropriate method for your environment.

Reviewing performance metrics using the vSphere Client

Connect to vCenter Server or an ESX/ESXi host using the vSphere Client.
Select the target virtual machine in the inventory.
Click the Performance tab.
Click Chart Options to customize the performance chart.
Under the CPU heading, select Real-time.
Under the Chart Type heading, select Line Graph.
Under the Objects list, select the virtual machine by name.
Under the Counters list, select Co-stop, Run, Ready, and Wait.
Optionally, save the chart settings to make re-use easier.
Click OK.
Make note of the four metrics displayed. Each is measured in milliseconds.

Reviewing performance metrics using esxtop or resxtop

For more information on using esxtop or resxtop, see the Performance Monitoring Utilities: resxtop and esxtop article from techdocs.broadcom.com.

Identify the host on which the unresponsive virtual machine is running:
1. Open the VMware vSphere Client and connect to your VMware vCenter Server or VirtualCenter server.
2. Select the virtual machine that is not responding.
3. Click the Summary tab, and identify the Host: value, indicating the host that has the running virtual machine registered to it.
Open a console session to the ESX/ESXi host where the virtual machine is running, or to the vMA, or to another location where the VMware Command-Line Interface (vCLI) is installed:
- For VMware ESX, log in to the service console using SSH or directly at its terminal. For more information, see Enable ESXi Shell and SSH Access with the Direct Console User Interface article from techdocs.broadcom.com.
Start the esxtop or resxtop command:
- esxtop
- resxtop --server HostNameOrIPAddress [--username root]
Press c on your keyboard to display the CPU panel.
Press V (uppercase) on your keyboard to display only virtual machines.
Identify the virtual machine by its Name or World ID.
Press f on your keyboard to change the visible fields. Ensure that the CPU State Times are visible:

ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT 186 186 VMName 4 2.11 2.08 0.00 397.64 0.25 197.71 0.20 0.00 0.00 0.00

Note: The esxtop fields in ESXi 5.0 include an additional field called %VMWAIT in the CPU view. VMWAIT is only applicable to the vCPU Worlds of a virtual machine. VMWAIT doesn't sum up the idle time. It only explains how much a virtual machine is in the blocked state.
Make note of the four metrics displayed. Each is measured as a percentage of time: %RUN, %WAIT, %RDY, %CSTP.

Interpreting CPU performance metrics

Run, %RUN:

This value represents the percentage of absolute time the virtual machine was running on the system.
If the virtual machine is unresponsive, %RUN may indicate that the guest operating system is busy conducting an operation.
When %RUN is near zero and the virtual machine is unresponsive, it means that the virtual machine is idle, blocked on an operation, or is not scheduled due to resource contention. Look at other values (%WAIT, %RDY, and %CSTP) to identify resource contention.
When %RUN is near the value of the number of vCPUS x 100%, it means that all vCPUs in the virtual machine are busy. This is an indicator that the guest operating system may be stuck in a operational loop. To investigate this issue further, you may need to engage the appropriate operating system vendor for assistance in identifying why the guest operating system is using all of the CPU resources.
If you have engaged the guest operating system vendor, and they have determined that the issue is caused by the VMware Tools or the virtual machine hardware, it may be pertinent to suspend the virtual machine to collect additional diagnostic information.

Wait, %WAIT:

This value represents the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.
If the virtual machine is unresponsive and the %WAIT value is proportionally higher than %RUN, %RDY, and %CSTP, then it can indicate that the world is waiting for a VMkernel operation to complete.
You may observe that the %SYS is proportionally higher than %RUN. %SYS represents the percentage of time spent by system services on behalf of the virtual machine.
A high %WAIT value can be a result of a poorly performing storage device where the virtual machine is residing. If you are experiencing storage latency and timeouts, it may trigger these types of symptoms across multiple virtual machines residing in the same LUN, volume, or array depending on the scale of the storage performance issue.
A high %WAIT value can also be triggered by latency to any device in the virtual machine configuration. This can include but is not limited to serial pass-through devices, parallel pass-through parallel , and USB devices. If the device suddenly stops functioning or responding, it can result in these symptoms. A common cause for a high %WAIT value is ISO files that are left mounted in the virtual machine accidentally are either deleted or moved to an alternate location. For more information, see Deleting a datastore from the Datastore inventory results in the error: device or resource busy (1015791).
If there does not appear to be any backing storage or networking infrastructure issue, it may be pertinent to crash the virtual machine to collect additional diagnostic information.

Ready, %RDY:

This value represents the percentage of time that the virtual machine is ready to execute commands, but has not yet been scheduled for CPU time due to contention with other virtual machines.
Compare against the Max-Limited, %MLMTD value. This represents the amount of time that the virtual machine was ready to execute, but has not been scheduled for CPU time because the VMkernel deliberately constrained it.
If the virtual machine is unresponsive or very slow and %MLMTD is low, it may indicate that the ESX host has limited CPU time to schedule for this virtual machine.

Co-stop, %CSTP:

This value represents the percentage of time that the virtual machine is ready to execute commands but that it is waiting for the availability of multiple CPUs as the virtual machine is configured to use multiple vCPUs.
If the virtual machine is unresponsive and %CSTP is proportionally high compared to %RUN, it may indicate that the ESX host has limited CPU resources, simultaneously co-schedule all vCPUs in this virtual machine.
Review the usage of virtual machines running with multiple vCPUs on this host. For example, a virtual machine with four vCPUs may need to schedule 4 pCPUs to do an operation. If there are multiple virtual machines configured in this way, it may lead to CPU contention and resource starvation.

Action Plan

When using performance metrics to troubleshoot any issue, capture samples to a persistent location so they can be referred to later. For more information, see Performance Data Collection using esxtop and resxtop.

Depending on the nature of the performance metrics determined, it may be pertinent to either crash or suspend the virtual machine to collect additional troubleshooting information, or to investigate a resource constraint or other performance issue. For further information, see the Action Plan section of Troubleshooting a virtual machine that has stopped responding (VM hang/freeze).

If %WAIT is relatively high and the virtual machine is unresponsive, but there are no backing storage or networking infrastructure problems, this indicates that the virtual machine may be blocked on some stuck operation. For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information.
If %RUN is relatively high and the virtual machine is unresponsive, this indicates that the guest operating system or virtual machine monitor is running very hot, possible indicating a runaway process. For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information.
To adjust your virtual machine distribution based on CPU contention, review Impact of virtual machine memory and CPU resource limits.

Additional Information

Enable ESXi Shell and SSH Access with the Direct Console User Interface
Performance Data Collection using esxtop and resxtop
Troubleshooting a virtual machine that has stopped responding (VM hang/freeze)
Impact of virtual machine memory and CPU resource limits
Troubleshooting ESX/ESXi virtual machine performance issues
Suspending a virtual machine on ESX/ESXi to collect diagnostic information

Setting the CPU limit to 0 makes virtual machines unresponsive

If for any reason you set the CPU limit of running virtual machines to

, this causes CPU starvation and the VMs become unresponsive.

This issue is resolved in this release-

VMware ESXi 7.0 Update 3v

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/release-notes/esxi-update-and-patch-release-notes/vsphere-esxi-70u3v-release-notes.html

Feedback

thumb_up Yes

thumb_down No