Virtual machine CPU usage spikes and remains abnormally high after vMotion in a VMware DRS enabled cluster

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

In a cluster with VMware Distributed Resource Scheduler (DRS) enabled, the CPU usage of a virtual machine may increase significantly after vMotion migrates a virtual machine. As a result, the performance of the virtual machine may be degraded.

Note: This issue is resolved in VirtualCenter 2.5.0 Update 2.

Environment

VMware ESXi 3.5.x Embedded
VMware ESXi 4.1.x Installable
VMware VirtualCenter 2.5.x
VMware ESXi 3.5.x Installable
VMware ESXi 4.1.x Embedded
VMware ESX Server 3.5.x

Resolution

Starting with ESXi/ESX 3.5 and VirtualCenter 2.5, VMware DRS applies a cap to the memory overhead of virtual machines to control the growth rate of this memory. This cap is reset to a virtual machine specific computed value after vMotion migrates the virtual machine. Afterwards, if the virtual machine monitor indicates that the virtual machine requires more overhead memory, VMware DRS raises this cap at a controlled rate (1 MB per minute, by default) to grant the required memory until the virtual machine overhead memory reaches a steady-state, and as long as there are sufficient resources available on the host.

For VirtualCenter 2.5, this cap is not increased to satisfy the virtual machine's steady-state demand as expected. Therefore the virtual machine operates with an overhead memory that is less than its desired size, which in turn may lead to higher observed virtual machine CPU usage and lower virtual machine performance in a VMware DRS-enabled cluster.

Diagnosing the issue

To diagnose the issue:

Log into VirtualCenter as an administrator using the Virtual Infrastructure Client.
Right-click your cluster from the inventory.
Click Edit Settings.
Disable VMware DRS.
Click OK and wait for 1 minute.
In the Virtual Infrastructure Client, note the virtual machine's CPU usage in the Performance tab and the virtual machine's memory overhead in the Summary tab.
Right-click your cluster from the inventory.
Click Edit Settings.
Re-enable VMware DRS.
Use vMotion to migrate a problematic virtual machine to another host.
Note the virtual machine CPU usage and memory overhead on the new host.
Disable VMware DRS on the cluster again as noted above, and wait for 1 minute.
Note the virtual machine CPU usage and memory overhead on the new host.

If the CPU usage of the virtual machine increases in step 11 in comparison to step 6, and decreases back to the original state (similar to the behavior in step 6) in step 13 with an observable increase in the overhead memory, this indicates the issue discussed in this article.

You do not need to disable DRS to work around this issue.

Working around the issue prior to VirtualCenter 2.5 Update 1

To work around this issue:

Log into VirtualCenter as an administrator using the Virtual Infrastructure Client.
Right-click your cluster from the inventory.
Click Edit Settings.
Ensure that VMware DRS is shown as enabled. If it is not enabled, click the checkbox to enable VMware DRS.
Click OK.
Click an ESXi/ESX host from the Inventory.
Click the Configuration tab.
Click Advanced Settings.
Click the Mem option.
Locate the Mem.VMOverheadGrowthLimit parameter.
Change the value of this parameter to 5 and click OK.

Note: By default, this parameter is set to -1.

Fixing multiple ESXi/ESX hosts

If this parameter needs to be changed on several hosts (or if the workaround fails for the individual host), use this procedure to implement the workaround instead of changing every server individually:

Log into the VirtualCenter Server Console as an administrator.
Make a backup copy of the vpxd.cfg file. This file is typically located in:

C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\vpxd.cfg
In the vpxd.cfg file, add this configuration between the <vpxd> and the </vpxd> tags:

<cluster> <VMOverheadGrowthLimit>5</VMOverheadGrowthLimit> </cluster>

This configuration provides an initial growth margin in MB-to-virtual machine overhead memory. You can increase this amount to larger values if doing so further improves virtual machine performance.
Restart the VMware VirtualCenter Server service.

Note: When you restart the VMware VirtualCenter Server service, the new value for the overhead limit is pushed down to all the clusters in VirtualCenter.

Note: If the new values are not pushed down to the ESXi/ESX hosts within 10 minutes:

Log into VirtualCenter as an administrator using the Virtual Infrastructure Client.
Right-click your cluster from the inventory.
Click Edit Settings.
Disable VMware DRS.
Click OK. Wait for the DRS-disable task to complete.
Right-click your cluster from the inventory.
Click Edit Settings.
Enable VMware DRS.
Click OK.

Working around the issue if it persists after upgrading to VirtualCenter 2.5 Update 1

After applying VirtualCenter 2.5 Update 1, it has been reported that under certain circumstances this behavior may persist.

To work around the issue:

Note: The previous steps also work, however this method is easier to implement and works for any ESXi/ESX host that is added to the DRS cluster.

Log into VirtualCenter as an administrator using the Virtual Infrastructure Client.
Right-click your cluster from the inventory.
Click Edit Settings.
Click VMware DRS (if it is not enabled, enable it).
Click Advanced Options.
Add MemOverheadGrowth with a value of 4.
Click OK to close the Advanced Options.
Click OK to close the cluster configuration.

A permanent fix for this behavior is included in VirtualCenter 2.5 Update 2.

Verifying the workaround

To verify that the setting has taken effect:

Log into your ESXi/ESX host's service console as root, either via an SSH session or directly from the console of the server.
Run the command:

less /var/log/vmkernel
- If the setting was successfully changed, you see a message similar to:
  
  vmkernel: 1:16:23:57.956 cpu3:1036)Config: 414: VMOverheadGrowthLimit" = 5, Old Value: -1, (Status: 0x0)
  
  No further action is required.
- If changing the setting was unsuccessful, you see a message similar to:
  
  vmkernel: 1:08:05:22.537 cpu2:1036)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Note: If you see a message indicating that the limit changed to 5 and then changing it back to -1, the fix has not been successfully applied. To resolve this:

Create a new cluster and move the ESXi/ESX hosts to this cluster.
Verify whether the fix has been implemented successfully.

Additional Information

For translated versions of this article, see: