vMotion or Storage vMotion of a Virtual Machine Assigned with NVIDIA vGPU Fails with a Timeout
search cancel

vMotion or Storage vMotion of a Virtual Machine Assigned with NVIDIA vGPU Fails with a Timeout

book

Article ID: 439447

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When performing vMotion or Storage vMotion on a virtual machine assigned with an NVIDIA vGPU profile, the task may fail with a timeout after processing has progressed.

  • Reviewing the vmware.log of the target VM reveals messages similar to the following during the "Stop and Copy" phase of the migration task:

Er(02) vcpu-x - vmiop_log: (0x0): Copy sysmem tracking failed, 0x7
Er(02) vcpu-x - vmiop_log: (0x0): CPU RPC async recv response failed: 0x7
Er(02) vcpu-x - vmiop_log: (0x0): Recv MIGRATION Stop and Copy RPC response failed, 0x7
Er(02) vcpu-x - vmiop_log: (0x0): stop and copy failed

Note: In addition to the above messages, the following may also be observed:

Er(02) vthread-xxxxxxx - vmiop_log: (0x0): GSP plugin task crashed. VM shutdown is required.

Environment

VMware vSphere ESXi

Cause

Because the NVIDIA GSP plugin is not functioning normally during the migration phase of vMotion or Storage vMotion, continuous synchronization of the frame buffer and memory state via the vmiop module cannot be performed, a timeout occurs in the Stop and Copy phase, and the vMotion or Storage vMotion task fails.

Note: Because memory tracking by vmiop is not required in cold migration (migration in a powered-off state), this issue does not occur.

Resolution

Because this issue is caused by the behavior of the vGPU module provided by NVIDIA, there is no permanent solution through configuration changes on the vSphere side.
If this issue continues to occur, please contact NVIDIA support.

Additional Information

NVIDIA vGPU が割り当てられた仮想マシンの vMotion または Storage vMotion がタイムアウトで失敗する (439450)