DRS fails to migrate vGPU enabled VM's automatically

search cancel

DRS fails to migrate vGPU enabled VM's automatically

book

Article ID: 313479

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When placing an Esxi host in maintenance mode DRS may fail to automatically migrate VM's that have vGPU enabled.
Under a Cluster's Monitor tab the "vSphere DRS" -> "Recommendations" section may have a VM Migration recommendation for vGPU VMs with a Reason "Destination host is selected for fixing policy/rule violation or its healthy state is better".
Manually vMotioning vGPU enabled VM's is working as expected.

Environment

VMware vCenter Server 7.0.x
VMware vCenter Server 6.7.x
VMware vCenter Server 8.0.2

Cause

This is expected behaviour at this time. Per the documentation, DRS will only complete initial placement of a VM with vGPU, but will not automatically load balance:

Using vMotion to Migrate vGPU Virtual Machines (vmware.com)

"DRS supports initial placement of vGPU VMs running vSphere 6.7 Update 1 and later without load balancing support"

Resolution

Starting with vSphere 8.0 U2, DRS can estimate the Stun Time for a given vGPU VM configuration. When the DRS Cluster Advanced Options are set and the Estimated VM Devices Stun Time for a VM is lower than the VM Devices vMotion Stun Time limit, DRS will automate VM migrations.

To enable this functionality, make sure the infrastructure meets the following requirements:

Healthy vSphere Lifecycle Services (see https://kb.vmware.com/s/article/91891)
Configuration of the VM's vGPU devices through the VCenter UI only
Healthy vMotion network (e.g. vMotion NICs setup through https://core.vmware.com/resource/cluster-quickstart)

Then add the following DRS Cluster Advanced Options:

Option: PassthroughDrsAutomation
Value: 1

Option: LBMaxVmotionPerHost
Value: 1

For vGPU VMs with Stun Times exceeding the "vMotion Stun Time Limit" (default 100 seconds), a VI Admin can add the following DRS Cluster Advanced Option:

Option: VmDevicesStunTimeTolerated
Value: <number of seconds, greater than any VM's Estimated Stun Time in the Cluster> (Default 100 seconds)

Modify the "vMotion Stun Time Limit" in the VM's Configuration -> "VM Options" Tab -> "Advanced" Section

For older releases, to resolve the issue please follow the below mentioned points:

For Maintenance Mode evacuations, please refer vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster.
If VM placement issues arise, reduce "DRS Automation" to "Partially Automated" please refer Creating a DRS Cluster for more information.

Workaround:
To workaround this issue manually migrate vGPU enabled VMs to another host.

Feedback

thumb_up Yes

thumb_down No