vGPU VM becomes unresponsive with vmotion

Products

VMware vSphere ESXi

Issue/Introduction

VM becomes unresponsive/powers off after the vMotion task is completed or failed.
- Environment includes Citrix persistent VDI.
Log signature of the issue will look like below
vmware.log:
YYYY-MM-DDTHH:MM:SS In(PID) vmx - Vix: [mainDispatch.c:1172]: VMAutomationPowerOff: Powering off.
YYYY-MM-DDTHH:MM:SS Wa(PID) vmx - /vmfs/volumes/datastore/vmname.vmx: Cannot remove symlink /var/run/vmware/X/XXXXXXXXX_XXXXXXX/configFile: No such file or directory
YYYY-MM-DDTHH:MM:SS In(PID) vmx - WORKER: asyncOps=6 maxActiveOps=2 maxPending=2 maxCompleted=2
YYYY-MM-DDTHH:MM:SS In(PID) vmx - Vix: [mainDispatch.c:4211]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
YYYY-MM-DDTHH:MM:SS In(PID) vmx - Msg_Post: Error
YYYY-MM-DDTHH:MM:SS In(PID) vmx - [msg.moduletable.powerOnFailed] Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS In(PID) vmx - [msg.vmx.poweron.failed] Failed to start the virtual machine.
YYYY-MM-DDTHH:MM:SS Wa(PID) vmx - POST(no connection): msg.vmx.poweron.failed
YYYY-MM-DDTHH:MM:SS Wa(PID)+ vmx - Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS Wa(PID)+ vmx - Failed to start the virtual machine.
vmkernel.log:
YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: vm XXXXXXX: XXXX: Evicting VM with path:/vmfs/volumes/datastore/vmname.vmx
YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: 211: Creating crypto hash
YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: vm XXXXXXX: XXXX: Could not find MemXferFS region for /vmfs/volumes/datastore/vmname.vmx

Environment

VMware vSphere ESXi 8.x

Cause

NVIDIA's GPU/driver is causing the timeout and failure to send data, migration process is waiting for GPU checkpoint data from NVIDIA for that duration during the Switchover phase.

Resolution

Ensure that the latest NVIDIA driver (GRID 18.5 or newer) is installed. For further questions, please reach out to NVIDIA.

Workaround:
- Requirements from vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster:
  - Healthy vSphere Cluster Services (Refer to: vSphere Cluster Services (vCLS) Known Issues/Corner Cases).
  - Healthy GPU and GPU driver (e.g. no Xids or Assertions)
  - Configuration of the VM's vGPU devices through the VCenter UI only.
  - Healthy vMotion network (Example: vMotion NICs setup through Cluster QuickStart).
- Disable Passthrough VM DRS Automation until NVIDIA's driver issue is fixed (e.g. PassthroughDrsAutomation set to 0). This will reduce the probability of vMotion issues as vGPU VMs will not attempt to migrate.
  - There may by tendency of vMotion issue if an attempt of evacuating a host using Maintenance Mode process. To avoid this, PowerOff vGPU VMs before initiating the Host Maintenance Mode task.
- If VI Admin wants to continue to use Passthrough VM DRS Automation.
  - For Ampere and newer GPUs with GRID 17.x and newer, a VI Admin can set the configuration option 'vmx.plugin.vmiop.waitForMigrateReadCallback="TRUE"' in the host config /etc/vmware/config or in the VMX configuration to further reduce the probability of the migration failing by disabling parallelized GPU checkpoint on the source.

Additional Information

The vMotion Process Under the Hood

vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster

Plan A for experiences if the concern is with a 10 second VDI connection timeout and does not need DRS automation:
- Disable "Passthrough VM DRS Automation" in the Cluster Advanced Settings (Cluster -> Configure -> vSphere DRS).
  - See KB: vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster for more information.
Plan B for experiences if the concern is not with a 10 second VDI connection timeout and does not need DRS automation for Maintenance Mode evacuations only:
- Disable "Passthrough VM DRS Automation" in the Cluster Advanced Settings (Cluster -> Configure -> vSphere DRS).
- Add option "VgpuMMAutomationTimeoutSecs" and value "-1" (both without quotes) in the Cluster Advanced Options
  - See KB: vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster for more information.
Plan C for customer experience if they are concerned with a 10 second VDI connection timeout and wants DRS automation only:
- Enable "Passthrough VM DRS Automation" in the Cluster Advanced Settings (Cluster -> Configure -> vSphere DRS).
- Sets "VM Devices Stun Time Limit" to 9 seconds.
  - See KB: vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster for more information.
Plan C will enable DRS automation for Load Balancing and Maintenance Mode migrations. This will only occur if the VM estimates if it can migrate under 9 seconds, assuming your VDI connection timeout is 10 seconds this should avoid most migrations of VDI VMs exceeding 9 seconds (assuming a healthy cluster with sufficient network bandwidth).