Error while powering on the VMs: "The 'libnvidia-vgx.so' plugin for vGPU could not be initialized. Failed to start the virtual machine. Module DevicePowerOn power on failed."
search cancel

Error while powering on the VMs: "The 'libnvidia-vgx.so' plugin for vGPU could not be initialized. Failed to start the virtual machine. Module DevicePowerOn power on failed."

book

Article ID: 386583

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

1. Virtual machines cannot access the vGPU, even though the ESXi host can see it in it's configuration.

2. Although iDRAC and ESXi configuration allows the NVIDIA graphics cards to display accurate memory installed, virtual machines do not receive allocations from this memory.

3. Before reboot of ESXi, NVIDIA graphics memory shows 0: 

4. After reboot of ESXi, the NVIDIA graphics show correct value:


5. Therefore, we receive a message when we turn the virtual machine off or shut it down and then turn it back on. 

"The 'libnvidia-vgx.so' plugin for vGPU could not be initialized. Failed to start the virtual machine. Module DevicePowerOn power on failed."

 

Environment

VMware ESXi 8.0.2 Build: 22380479

Cause

1. There are multiple Xids being reported in the vmkernel logs at the time of issue:

vmkernel.4:3757:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu26:2108263)NVRM: Xid (PCI:0000:21:00): 31, pid=XXXXXXX, name=, Ch 0000042c

vmkernel.5:52557:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu64:2108277)NVRM: Xid (PCI:0000:e2:00): 44, pid=XXXXXXX, name=, 0420 0000c797 00000000 00000000

vmkernel.5:52558:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu64:2108277)NVRM: Xid (PCI:0000:e2:00): 31, pid=XXXXXXX, name=, Ch 00000420

vmkernel.6:18323:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu111:2108852)NVRM: Xid (PCI:0000:e2:00): 44, pid=XXXXXXX, name=, 0023 0000c797 00000000 00000000

vmkernel.6:18324:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu111:2108852)NVRM: Xid (PCI:0000:e2:00): 31, pid=XXXXXXX, name=, Ch 00000023

2. Depending on timing, Xids may cause issues with GPU resource tracking due to failed communication with the NVIDIA management layers.

3. NVIDIA needs to triage the Xids those are workload specific.

For example: 

* Xid 31 GPU memory page fault
* Xid 44 Graphics Engine fault during context switch


4. The VM PowerOn failures, as hostd is reporting vGPU devices:

hostd.log:2769:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102806]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400

hostd.log:5733:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102845]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400

hostd.log:7152:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102847]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400

hostd.log:7388:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102856]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400

hostd.log:7434:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102834]: [Originator@6876 sub=Libs opID=XXXXXXX-XXXXXXX-XXXX-X:XXXXXXXXX-XX-X-X sid=XXXXXXXX user=vpxuser:XXXX] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400

Resolution

1. As error appeared on physical GPUs so it is a hardware related concern.

2. Contact NVIDIA to validate what is causing these Xid errors.

3. It's possible the workload is too heavy for this system. Try reducing the VM CPU configuration. This will help to reduce latency when switching between VMs.