SDDC Manager upgrade precheck fails because VMs with passthrough NVIDIA GPUs and UVM enabled cannot be live-migrated.

search cancel

SDDC Manager upgrade precheck fails because VMs with passthrough NVIDIA GPUs and UVM enabled cannot be live-migrated.

book

Article ID: 409562

calendar_today

Updated On:

Products

VMware SDDC Manager

Issue/Introduction

When attempting to upgrade hosts via SDDC Manager, the upgrade precheck fails with the following error:

A VM faults were found while performing a dry run enter in maintenance mode validation.
High: Do not perform upgrade without addressing this issue.

Additional details:

Precheck validation reports failure for host <host_fqdn> in cluster <vc_fqdn>.

vCenter Tasks and Events show entries similar to:

Unable to automatically migrate <vm_name> from <host_fqdn>
DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode.
This condition typically occurs because no other host in the DRS cluster is compatible with the virtual machine
vCenter vpxd.log (/var/log/vmware/vpxd/vpxd.log) records errors such as:

CompatCheck results: (vim.vm.check.Result) [
vm = 'vim.VirtualMachine:<vmid>',
host = 'vim.HostSystem:<host_fqdn>',
error = (vim.fault.InsufficientResourcesFault) {
faultMessage = [
key = "com.vmware.vim.vpxd.vmcheck.assignHwNotAvailable"
missing = "pciPassthru0"
]
}
]

Environment

VMware Cloud Foundation 5.x
VMware vSphere 7.x
VMware vSphere 8.x

Cause

The VM is configured with PCI passthrough (DirectPath I/O) for an NVIDIA GPU.
The VM .vmx file includes the parameter: pciPassthru0.cfg.enable_uvm = 1
This setting enables NVIDIA Unified Virtual Memory (UVM) inside the guest OS, which allows CUDA Unified Memory workloads to share CPU and GPU address spaces.
However, vMotion is not supported when UVM is enabled on passthrough GPUs, because memory state cannot be checkpointed or transferred across hosts.
As a result, DRS fails to migrate the VM during the dry-run Enter Maintenance Mode (EMM) validation, causing the SDDC Manager upgrade precheck to fail.

Resolution

There are two options, depending on workload requirements:

1. Use NVIDIA vGPU instead of passthrough (Recommended)

Deploy the NVIDIA vGPU Manager (vCS) on ESXi.
Assign a supported vGPU profile to the VM instead of raw passthrough.
vGPU abstracts the GPU memory and state, enabling vMotion, DRS, and HA.
This method supports vMotion with NVIDIA GPUs except when CUDA Unified Memory (UVM) is enabled.
Reference: NVIDIA AI Enterprise User Guide

2. Continue using passthrough (DirectPath I/O) or CUDA Unified Memory, but disable vMotion dependency

If workloads strictly require full bare-metal GPU access, vMotion cannot be used.
In this case:
- Manually power off the VM(s) using GPU passthrough or CUDA Unified Memory before placing the ESXi host into Maintenance Mode.
- Proceed with the ESXi upgrade.
- Power on the VM(s) after the upgrade completes.
Note: This approach does not provide vMotion, HA restart, or DRS balancing for these VMs. Availability must be managed at the application layer or through alternative recovery mechanisms.

Additional Information

This behavior is expected when using DirectPath I/O with Unified Virtual Memory (UVM) enabled. Workloads that require CUDA Unified Memory have two options:

Migrate to vGPU, which supports vMotion and DRS.
If you choose to remain on passthrough, be aware that vMotion and DRS functionality will not be available.

For context, the advanced parameter pciPassthru0.cfg.enable_uvm = 1 enables UVM, which links GPU and CPU memory for advanced workloads such as AI/ML, HPC, and GPU compute. While this feature is essential for certain applications, it inherently prevents vMotion compatibility.

Feedback

thumb_up Yes

thumb_down No