Passthrough NVIDIA GPU becomes unresponsive and disappears inside the virtual machine
search cancel

Passthrough NVIDIA GPU becomes unresponsive and disappears inside the virtual machine

book

Article ID: 418517

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

A virtual machine using an NVIDIA passthrough GPU suddenly lost access to the GPU device. Inside the guest OS, the GPU no longer appeared in the output of command "nvidia-smi", and the following error was displayed,

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. 

Below error is observed from vmkernel log on ESXi host 

Skipping device reset on <PCIe ID> because PCIe link to the device is down.

Rebooting the virtual machine did not resolve the issue.

Environment

  • ESXi host with PCI passthrough enabled

  • Virtual machine using an NVIDIA GPU via passthrough

  • NVIDIA driver installed inside the guest OS

  • vSphere 8.x / 9.x

  • No recent configuration changes on vSphere or in the VM

 

Cause

The NVIDIA passthrough GPU became unresponsive at the hardware or firmware level. When the physical GPU stops responding, the VM cannot access the device, and the driver within the guest OS fails to communicate with the GPU. A guest OS reboot cannot recover the GPU because the hardware remains in a stuck state

This behavior indicates a potential GPU hardware or firmware stability issue.

Resolution

Reboot the ESXi host to restore access to the NVIDIA passthrough GPU so that the PCI device can be fully reinitialized during host startup.

Since the GPU dropped unexpectedly during normal VM operation, it is recommended to engage the server hardware vendor to check the GPU hardware and firmware health, including,

  • GPU card stability

  • PCIe slot health

  • GPU firmware/BIOS version

  • Thermal and power conditions

  • Known GPU reset or firmware lock-up issues

Additional Information

 

  • Once a passthrough GPU becomes unresponsive, ESXi and the VM cannot reset the device without a full host reboot.

  • Similar symptoms may occur if the GPU firmware locks up, overheats, or loses PCIe link stability.