2. Although iDRAC and ESXi configuration allows the NVIDIA graphics cards to display accurate memory installed, virtual machines do not receive allocations from this memory.
3. Before reboot of ESXi, NVIDIA graphics memory shows 0:
4. After reboot of ESXi, the NVIDIA graphics show correct value:
5. Therefore, we receive a message when we turn the virtual machine off or shut it down and then turn it back on.
"The 'libnvidia-vgx.so' plugin for vGPU could not be initialized. Failed to start the virtual machine. Module DevicePowerOn power on failed."
VMware ESXi 8.0.2 Build: 22380479
1. There are multiple Xids being reported in the vmkernel logs at the time of issue:
vmkernel.4:3757:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu26:2108263)NVRM: Xid (PCI:0000:21:00): 31, pid=XXXXXXX, name=, Ch 0000042c
vmkernel.5:52557:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu64:2108277)NVRM: Xid (PCI:0000:e2:00): 44, pid=XXXXXXX, name=, 0420 0000c797 00000000 00000000
vmkernel.5:52558:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu64:2108277)NVRM: Xid (PCI:0000:e2:00): 31, pid=XXXXXXX, name=, Ch 00000420
vmkernel.6:18323:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu111:2108852)NVRM: Xid (PCI:0000:e2:00): 44, pid=XXXXXXX, name=, 0023 0000c797 00000000 00000000
vmkernel.6:18324:XXXX-XX-XXTXX:XX:XX.XXXZ In(182) vmkernel: cpu111:2108852)NVRM: Xid (PCI:0000:e2:00): 31, pid=XXXXXXX, name=, Ch 00000023
2. Depending on timing, Xids may cause issues with GPU resource tracking due to failed communication with the NVIDIA management layers.
3. NVIDIA needs to triage the Xids those are workload specific.
For example:
* Xid 31 GPU memory page fault
* Xid 44 Graphics Engine fault during context switch
4. The VM PowerOn failures, as hostd is reporting vGPU devices:
hostd.log:2769:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102806]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400
hostd.log:5733:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102845]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400
hostd.log:7152:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102847]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400
hostd.log:7388:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102856]: [Originator@6876 sub=Libs] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400
hostd.log:7434:XXXX-XX-XXTXX:XX:XX.XXXZ In(166) Hostd[2102834]: [Originator@6876 sub=Libs opID=XXXXXXX-XXXXXXX-XXXX-X:XXXXXXXXX-XX-X-X sid=XXXXXXXX user=vpxuser:XXXX] GraphicsInfo: RefreshHostGraphics vsgaDevices: 0, vgpuDevices 3, dvmDevices 0, numDevices 99, vimopDevCptSaveRate 6400, vmiopDevCptRestoreRate 6400
1. As error appeared on physical GPUs so it is a hardware related concern.
2. Contact NVIDIA to validate what is causing these Xid errors.
3. It's possible the workload is too heavy for this system. Try reducing the VM CPU configuration. This will help to reduce latency when switching between VMs.