vGPU VM becomes unresponsive with vmotion
search cancel

vGPU VM becomes unresponsive with vmotion

book

Article ID: 403588

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VM becomes unresponsive/powers off after the vMotion task is completed or failed.
    • Environment includes Citrix persistent VDI.
  • Log signature of the issue will look like below
    vmware.log:
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - Vix: [mainDispatch.c:1172]: VMAutomationPowerOff: Powering off.
    YYYY-MM-DDTHH:MM:SS  Wa(PID) vmx - /vmfs/volumes/datastore/vmname.vmx: Cannot remove symlink /var/run/vmware/X/XXXXXXXXX_XXXXXXX/configFile: No such file or directory
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - WORKER: asyncOps=6 maxActiveOps=2 maxPending=2 maxCompleted=2
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - Vix: [mainDispatch.c:4211]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - Msg_Post: Error
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - [msg.moduletable.powerOnFailed] Module 'Migrate' power on failed.
    YYYY-MM-DDTHH:MM:SS In(PID) vmx - [msg.vmx.poweron.failed] Failed to start the virtual machine.
    YYYY-MM-DDTHH:MM:SS Wa(PID) vmx - POST(no connection): msg.vmx.poweron.failed
    YYYY-MM-DDTHH:MM:SS Wa(PID)+ vmx - Module 'Migrate' power on failed.
    YYYY-MM-DDTHH:MM:SS Wa(PID)+ vmx - Failed to start the virtual machine.
    vmkernel.log:
    YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: vm XXXXXXX: XXXX: Evicting VM with path:/vmfs/volumes/datastore/vmname.vmx
    YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: 211: Creating crypto hash
    YYYY-MM-DDTHH:MM:SS In(PID) vmkernel: cpu69:#######)VmMemXfer: vm XXXXXXX: XXXX: Could not find MemXferFS region for /vmfs/volumes/datastore/vmname.vmx

Environment

VMware vSphere ESXi 8.x

Cause

NVIDIA's GPU/driver is causing the timeout and failure to send data, migration process is waiting for GPU checkpoint data from NVIDIA for that duration during the Switchover phase.

Resolution

Ensure that the latest NVIDIA driver (GRID 18.5 or newer) is installed. For further questions, please reach out to NVIDIA.

  • Workaround:
    • Requirements from vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster:
      • Healthy vSphere Cluster Services  (Refer to: vSphere Cluster Services (vCLS) Known Issues/Corner Cases).
      • Healthy GPU and GPU driver (e.g. no Xids or Assertions)
      • Configuration of the VM's vGPU devices through the VCenter UI only.
      • Healthy vMotion network (Example: vMotion NICs setup through Cluster QuickStart).
    • Disable Passthrough VM DRS Automation until NVIDIA's driver issue is fixed (e.g. PassthroughDrsAutomation set to 0). This will reduce the probability of vMotion issues as vGPU VMs will not attempt to migrate.
      • There may by tendency of vMotion issue if an attempt of evacuating a host using Maintenance Mode process. To avoid this, PowerOff vGPU VMs before initiating the Host Maintenance Mode task.
    • If VI Admin wants to continue to use Passthrough VM DRS Automation.
      • For Ampere and newer GPUs with GRID 17.x and newer, a VI Admin can set the configuration option 'vmx.plugin.vmiop.waitForMigrateReadCallback="TRUE"' in the host config /etc/vmware/config or in the VMX configuration to further reduce the probability of the migration failing by disabling parallelized GPU checkpoint on the source.

Additional Information

The vMotion Process Under the Hood

vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster

  • Plan A for experiences if the concern is with a 10 second VDI connection timeout and does not need DRS automation:
  • Plan B for experiences if the concern is not with a 10 second VDI connection timeout and does not need DRS automation for Maintenance Mode evacuations only:
  • Plan C for customer experience if they are concerned with a 10 second VDI connection timeout and wants DRS automation only:
  • Plan C will enable DRS automation for Load Balancing and Maintenance Mode migrations. This will only occur if the VM estimates if it can migrate under 9 seconds, assuming your VDI connection timeout is 10 seconds this should avoid most migrations of VDI VMs exceeding 9 seconds (assuming a healthy cluster with sufficient network bandwidth).