Virtual Machines utilizing NVIDIA vGPU become unresponsive after vMotion during active video playback
search cancel

Virtual Machines utilizing NVIDIA vGPU become unresponsive after vMotion during active video playback

book

Article ID: 438193

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x

Issue/Introduction

  • Windows Virtual Machines (VMs) utilizing NVIDIA vGPU profiles may successfully complete a vMotion migration in vCenter, only to become unresponsive immediately afterward. The guest operating system eventually crashes with a Blue Screen of Death (BSOD) 0x116 (VIDEO_TDR_ERROR) and requests a hard reset.
  • This issue typically stems from a resource initialization failure during the "switchover" phase of the migration.
  • The following timeline, reconstructed from the impacted VM's vmware.log file (/vmfs/volumes/datastore_name/vm_name/vmware.log), outlines the sequence of events from the migration resume phase through to the final VM reset:

 Migration Completion and Resume Initiation

The destination host successfully creates IOMMU mappings and prepares to restore the vGPU state.

vmx - MigrateSetInfo: state=MIGRATE_FROM_VMX_INIT srcIp=[IP_ADDRESS] dstIp=[IP_ADDRESS] 
vcpu-0 - PCIPassthru: successfully created the IOMMU mappings
vcpu-0 - VMIOP: notifying plugin state unstun

 vGPU Resource Allocation Failures

The NVIDIA vmiop plugin attempts to load the guest driver and allocate GPU memory/resources on the destination hardware but fails with error code 0x40.

vthread-18277215 - vmiop_log: (0x0): Guest driver loaded.
vthread-18277215 - vmiop_log: (0x0): NvRmAlloc failed with error 0x40
vthread-18277215 - vmiop_log: (0x0): VGPU message 6 failed, result code: 0x40
vthread-18277215 - vmiop_log: (0x0): VGPU message 9 failed, result code: 0xff000005

Migration Resume Failure

Because the vGPU state cannot be restored, the migration resume task fails, leading to an asynchronous RPC timeout.

vthread-18277215 - vmiop_log: (0x0): RmControl 0xb06f0112 failed with error 0x57
vcpu-0 - vmiop_log: (0x0): CPU RPC async recv response failed: 0x1
vcpu-0 - vmiop_log: (0x0): Recv MIGRATION Resume response failed, 0x1

Guest OS Crash (BSOD 0x116)

Windows detects that the graphics hardware is unresponsive (TDR - Timeout Detection and Recovery) and writes the crash information to the Synthetic MSRs.

vcpu-0 - WinBSOD: Synthetic MSR[0x40000100] 0x116
vcpu-0 - WinBSOD: Synthetic MSR[0x40000101] 0xffff8f0bac062010
vcpu-0 - WinBSOD: Synthetic MSR[0x40000102] 0xfffff80361539980
vcpu-0 - WinBSOD: Synthetic MSR[0x40000103] 0xffffffffc000009a
vcpu-0 - WinBSOD: Synthetic MSR[0x40000104] 0x4

Automatic VM Recovery (Hard Reset)

The guest OS triggers a hard reset, and the SVGA device is disabled as the VM reboots.

svga - SVGA disabling SVGA
vcpu-0 - Chipset: The guest has requested that the virtual machine be hard reset.
vcpu-0 - vmiop_log: (0x0): Copy sysmem tracking failed, 0x7
vmx - Vix: [vmxCommands.c:686]: VMAutomation_Reset. Trying hard reset

Environment

VMware vSphere ESXi

Cause

The primary cause is the concurrent enablement of VMware Native 3D Support (SVGA 3D) and an NVIDIA vGPU profile.

When both are active, the destination ESXi host encounters a resource conflict during vMotion that prevents the NVIDIA driver from successfully mapping the required vGPU memory segments.

This results in the NvRmAlloc failure (0x40) and subsequent Windows TDR BSOD.

Resolution

To resolve this issue, the VMware native 3D acceleration must be disabled, as the NVIDIA vGPU profile provides all necessary 3D capabilities.

  1. Log in to the vSphere Client for the vCenter Server that manages the impacted Virtual Machine.

  2. Locate the impacted Virtual Machine by navigating through the Hosts and Clusters inventory tree, or by using the global search bar at the top of the client.

  3. Right-click the Virtual Machine, navigate to Power, and select Power Off

  4. Edit Settings > Virtual Hardware tab.

  5. Locate the Video Card and expand its properties.

  6. Uncheck the "Enable 3D Support" checkbox.

  7. Save the settings and Power on the VM.

  8. Verify vMotion functionality.

Additional Information

  • NVIDIA Bug 2369683: Documentation on GPU resource availability conflicts with Instant Clones.