Checking for resource starvation of the ESX Service Console

Products

VMware vSphere ESXi

Issue/Introduction

For troubleshooting purposes, it may be necessary to check if any processes are consuming a substantial amount of resources on the service console. Processes consuming a substantial amount of resources can prevent correct operation of the ESX system. This article provides you with the steps to check for starvation of resources on the ESX host service console.

Symptoms:

High CPU utilization on an ESX host
High memory utilization on an ESX host
Slow response when administering an ESX host

Environment

VMware ESX 4.1.x
VMware ESX 4.0.x
VMware ESX Server 3.5.x
VMware ESX Server 3.0.x

Resolution

Introduction to performance monitoring

If any process is utilizing a substantial amount of CPU or memory on your ESX host service console it can prevent correct operation of the system. ESX includes the top utility to be able to check for resource utilization on the service console. It can be used to view the current values for the statistics and to determine if there is starvation of resources on the ESX host service console.

To check the utilization of the processes on the service console:

Log in to your ESX host service console as root from either an SSH session or directly from the console of the server.
Type top.
To exit top, press Q.
When you have finished reviewing the output, type logout and press Enter to exit the system.

This screen appears and shows the resource utilization and running processes on the server:

Checking for CPU Starvation of an ESX host

The statistics you must review are load average and CPU Idle. These statistics provide an overall indication of how busy the ESX host is.

Load average is a measurement of the number of processes that currently waiting in the run-queue plus the number of processes that being executed for 1-, 5-, and 15-minute intervals. A load average of 1.00 means that the ESX host machine's physical CPUs are fully utilized, and a load average of 0.5 indicates they are half utilized. A load average of 2.00 indicates that the system is busy. If the load average is over 4.00, the system is heavily utilized and performance is impacted.

A load average similar to this indicates that the ESX Service Console does not have a queue of tasks waiting to process:

load average: 0.14, 0.06, 0.01

A load average similar to this indicates that tasks are waiting in the run queue to be processed:

load average: 2.00, 2.00, 2.00

The CPU state counters provide an overview of the CPU utilization in each state on the system. if your screen looks like this, your system has a high CPU idle percentage. A high CPU idle means that the system not busy:

CPU states: cpu user nice system irq softirq iowait idle total 0.1% 0.0% 0.0% 0.0% 1.3% 12.1% 86.2%

If the CPU idle counter output is low, investigate into which state is consuming the CPU time. The different states mean:

User is the percentage of the processor time used for running user processes, such as an application.
Nice is percentage of the processor time used for a user process that is running with an altered scheduling priority.
System is the percentage of the processor time used for a system process, such as kernel or driver calls.
Irq is the percentage of the processor time used for hardware interrupt requests.
Softirq is the percentage of the processor time used for software interrupt requests.
Iowait is the percentage of the processor time waiting on the completion of disk Input/Output.
Idle is the percentage of the processor time that processors are free.

When the CPU idle state is at 0%, it looks like this:

CPU states: cpu user nice system  irq softirq iowait idle
          total 1.1% 0.0%   0.1% 0.0%    0.0%  98.6% 0.0%

The CPU time is being consumed in the iowait state. If the CPU time is being consumed in the iowait state, check the disk subsystem to determine what is causing the delay in response from the storage subsystem.

Note: If the CPU time is being consumed in the user state, you can determine the process that is consuming the CPU from the list of tasks below the statistics. The list of tasks refreshes every few seconds to provide an updated view of the process list. In this example, vmware-hostd is consuming 0.9% of the available CPU:

Checking for Memory Starvation of an ESX host

Memory and swap are the statistics you need to review. These statistics provide an overall indication of how much memory is being used and if there is heavy swapping occurring on the system. This screen shows an example of the expected output:

The example above indicates that there is 268248KB (268MB) of RAM in the system and that 84864KB (85MB) is free. There is 554168KB (554MB) of swap available in the system and 503152KB (503MB) is free. In this case there is substantial RAM available for the service console to use and therefore very little swapping occurs.

Note: This view only shows you the amount of RAM that is assigned to the ESX host service console, it does not provide a view of the total RAM in the server.

To troubleshoot an ESX host that shows a low amount of RAM and high amount of swapping:

Disable any third party services that have been installed for testing. The third party services may be using up memory resources.
Try increasing the amount of RAM that has been assigned to the ESX host service console. For more information, see Increasing the amount of RAM assigned to the ESX Server service console (1003501).
Check all virtual machine configurations to ensure none of them have an unreasonably high CPU reservation, like 10000MHz.

Note: You can also see the amount of memory and swap currently in use from the /proc/meminfo file.

I/O Starvation can be caused by many issues, but commonly occurs when a LUN is removed and the ESX host is not rescanned. To properly remove LUNs from your ESX host, see Removing a LUN containing a datastore from VMware ESXi/ESX 4.x (1029786).

Additional Information

For more information, see VMware HA configuration fails with a VMWareClusterManager Rule not enabled error (1004495).

Increasing the amount of RAM assigned to the ESX Server service console
VMware HA configuration fails with a VMWareClusterManager Rule not enabled error
Removing a LUN containing a datastore from VMware ESXi/ESX 4.0 and 4.1
Como verificar a escassez de recursos no Console de Serviço ESX
Control del colapso de recursos de la consola de servicio de ESX
检查 ESX 服务控制台是否存在资源匮乏问题
ESX サービスコンソールのリソーススタベーションを確認する