A virtual machine using an NVIDIA passthrough GPU suddenly lost access to the GPU device. Inside the guest OS, the GPU no longer appeared in the output of command "nvidia-smi", and the following error was displayed,
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Below error is observed from vmkernel log on ESXi host
Skipping device reset on <PCIe ID> because PCIe link to the device is down.
Rebooting the virtual machine did not resolve the issue.
ESXi host with PCI passthrough enabled
Virtual machine using an NVIDIA GPU via passthrough
NVIDIA driver installed inside the guest OS
vSphere 8.x / 9.x
No recent configuration changes on vSphere or in the VM
The NVIDIA passthrough GPU became unresponsive at the hardware or firmware level. When the physical GPU stops responding, the VM cannot access the device, and the driver within the guest OS fails to communicate with the GPU. A guest OS reboot cannot recover the GPU because the hardware remains in a stuck state
This behavior indicates a potential GPU hardware or firmware stability issue.
Reboot the ESXi host to restore access to the NVIDIA passthrough GPU so that the PCI device can be fully reinitialized during host startup.
Since the GPU dropped unexpectedly during normal VM operation, it is recommended to engage the server hardware vendor to check the GPU hardware and firmware health, including,
GPU card stability
PCIe slot health
GPU firmware/BIOS version
Thermal and power conditions
Known GPU reset or firmware lock-up issues
Once a passthrough GPU becomes unresponsive, ESXi and the VM cannot reset the device without a full host reboot.
Similar symptoms may occur if the GPU firmware locks up, overheats, or loses PCIe link stability.