vGPU VM vmotion stuck at 70%
search cancel

vGPU VM vmotion stuck at 70%

book

Article ID: 320352

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VMs with vGPU stuck at 70% during vmotion, the issue doesn't happen every time during vmotion but can be reproduced from time to time
  • The vmotion task finally times out and the VM still runs on the source host

Environment

VMware vSphere ESXi 7.0
VMware vSphere ESXi 8.0

Cause

Source vmware.log shows that the migration has exceeded the maximum switchover time of 100 seconds and thus the vmotion task failed: 

[YYYY-MM-DDTHH:MM:SS] In(05) worker-3476962 - Migrate: Remote Log: Destination waited for 208.42 seconds.
[YYYY-MM-DDTHH:MM:SS] In(05) worker-3476962 - Migrate: Remote Log: Beginning checkpoint restore.
[YYYY-MM-DDTHH:MM:SS] In(05) worker-3476962 - Migrate: Remote Log: Switching to checkpoint state.
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - MigrateWriteHostLog: Writing to log file took 3903 us.
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - MigrateSetStateFinished: type=1 new state=6
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - MigrateSetState: Transitioning from state 5 to 6.
[YYYY-MM-DDTHH:MM:SS] No(00) vmx - ConfigDB: Setting config.readOnly = "FALSE"
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - Migrate_SetFailureMsgList: switching to new log file.
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - Migrate_SetFailureMsgList: Now in new log file.
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - Migrate: Caching migration error message list:
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - [msg.checkpoint.migration.maxSwitchoverTimeExceeded] The migration has exceeded the maximum switchover time of 100 second(s). ESX has preemptively failed the migration to allow the VM to continue running on the source.  To avoid this failure, either increase the maximum allowable switchover time or wait until the VM is performing a less intensive workload.
[YYYY-MM-DDTHH:MM:SS] In(05) vmx - Migrate: Attempting to continue running on the source.

Destination vmware.log indicates that the destination VM is in process of restoring vGPU state and stuck at this process: 

[YYYY-MM-DDTHH:MM:SS] In(05) vthread-4084785 - vmiop_log: (0x0): Start restoring vGPU state ...
[YYYY-MM-DDTHH:MM:SS] In(05) worker-4084777 - GetHostManifests: Done extracting the manifest file.
[YYYY-MM-DDTHH:MM:SS] In(05) worker-4084777 - Using ToolsMinVersion = 8384
[YYYY-MM-DDTHH:MM:SS] In(05) worker-4084777 - ToolsVersionGetStatusWorkerThread: Tools status 3 derived from environment
[YYYY-MM-DDTHH:MM:SS] In(05) vthread-4084785 - vmiop_log: Migration source host driver version: 470.103.02
[YYYY-MM-DDTHH:MM:SS] Wa(03) vmx - Caught signal 15 -- tid 4083474 (eip 0xcb8bfa2dbc)

Resolution

Vendor of vGPU needs to be engaged for further analysis

Workaround:
Vendor of vGPU needs to be engaged for further analysis

Additional Information

Impact/Risks:
-VM with vGPU failed to be vmotioned to another host