VMs hang and ESXi host becomes unresponsive when attempting to migrate Nvidia GPU Enabled VMs
search cancel

VMs hang and ESXi host becomes unresponsive when attempting to migrate Nvidia GPU Enabled VMs

book

Article ID: 412744

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

- VM migrations within a Nvidia GPU enabled ESXi cluster hang at or near 55%
- The source ESXi host hangs during the migration or after restarting management agents on the host
- Logs similar to the following can be found in the vmware.log for each affected VM

YYYY-MM-DDTHH:MM:SS In(05) vmx - Vix: [mainDispatch.c:1172]: VMAutomationPowerOff: Powering off.
YYYY-MM-DDTHH:MM:SS  Wa(03) vmx - /vmfs/volumes/datastore/vmname.vmx: Cannot remove symlink /var/run/vmware/X/XXXXXXXXX_XXXXXXX/configFile: No such file or directory
YYYY-MM-DDTHH:MM:SS In(05) vmx - WORKER: asyncOps=6 maxActiveOps=2 maxPending=2 maxCompleted=2
YYYY-MM-DDTHH:MM:SS In(05) vmx - Vix: [mainDispatch.c:4211]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
YYYY-MM-DDTHH:MM:SS In(05) vmx - Msg_Post: Error
YYYY-MM-DDTHH:MM:SS In(05) vmx - [msg.moduletable.powerOnFailed] Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS In(05) vmx - [msg.vmx.poweron.failed] Failed to start the virtual machine.
YYYY-MM-DDTHH:MM:SS Wa(03) vmx - POST(no connection): msg.vmx.poweron.failed
YYYY-MM-DDTHH:MM:SS Wa(03)+ vmx - Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS Wa(03)+ vmx - Failed to start the virtual machine.

The corresponding logs will be found in the vmkernel log of the source ESXi host:

YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: vm XXXXXXX: XXXX: Evicting VM with path:/vmfs/volumes/datastore/vmname.vmx
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: 211: Creating crypto hash
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: vm XXXXXXX: XXXX: Could not find MemXferFS region for /vmfs/volumes/datastore/vmname.vmx

Environment

vSphere ESXi 8.X

Cause

NVIDIA's GPU/driver is causing a timeout and failure to send data during the migration.
The migration process is waiting for GPU checkpoint data from NVIDIA for that duration during the Switchover phase.

Resolution

Please see the following KB for resolution steps: vGPU VM becomes unresponsive after vMotion task is completed