- VM migrations within a Nvidia GPU enabled ESXi cluster hang at or near 55%
- The source ESXi host hangs during the migration or after restarting management agents on the host
- Logs similar to the following can be found in the vmware.log for each affected VM
YYYY-MM-DDTHH:MM:SS In(05) vmx - Vix: [mainDispatch.c:1172]: VMAutomationPowerOff: Powering off.
YYYY-MM-DDTHH:MM:SS Wa(03) vmx - /vmfs/volumes/datastore/vmname.vmx: Cannot remove symlink /var/run/vmware/X/XXXXXXXXX_XXXXXXX/configFile: No such file or directory
YYYY-MM-DDTHH:MM:SS In(05) vmx - WORKER: asyncOps=6 maxActiveOps=2 maxPending=2 maxCompleted=2
YYYY-MM-DDTHH:MM:SS In(05) vmx - Vix: [mainDispatch.c:4211]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
YYYY-MM-DDTHH:MM:SS In(05) vmx - Msg_Post: Error
YYYY-MM-DDTHH:MM:SS In(05) vmx - [msg.moduletable.powerOnFailed] Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS In(05) vmx - [msg.vmx.poweron.failed] Failed to start the virtual machine.
YYYY-MM-DDTHH:MM:SS Wa(03) vmx - POST(no connection): msg.vmx.poweron.failed
YYYY-MM-DDTHH:MM:SS Wa(03)+ vmx - Module 'Migrate' power on failed.
YYYY-MM-DDTHH:MM:SS Wa(03)+ vmx - Failed to start the virtual machine.
The corresponding logs will be found in the vmkernel log of the source ESXi host:
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: vm XXXXXXX: XXXX: Evicting VM with path:/vmfs/volumes/datastore/vmname.vmx
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: 211: Creating crypto hash
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu69:2104176)VmMemXfer: vm XXXXXXX: XXXX: Could not find MemXferFS region for /vmfs/volumes/datastore/vmname.vmx
vSphere ESXi 8.X
NVIDIA's GPU/driver is causing a timeout and failure to send data during the migration.
The migration process is waiting for GPU checkpoint data from NVIDIA for that duration during the Switchover phase.
Please see the following KB for resolution steps: vGPU VM becomes unresponsive after vMotion task is completed