VM with an NVIDIA GPU in passthrough mode fails to power on with the error: "Module 'DevicePowerOn' power on failed".
search cancel

VM with an NVIDIA GPU in passthrough mode fails to power on with the error: "Module 'DevicePowerOn' power on failed".

book

Article ID: 384765

calendar_today

Updated On:

Products

VMware vSphere ESXi 7.0 VMware vSphere ESXi 8.0

Issue/Introduction

  • VM fails to power on when using an Nvidia Tesla GPU in passthrough mode on ESXi.
  • In VM's vmware.log, you see entries similar to:

[YYYY-MM-DDTHH:MM:SS] In(05) vmx - Module 'DevicePowerOn' power on failed

  • In /var/run/log/vmkernel.log, you see entries similar to:

[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 4636: pcipDevInfo(0x431f80001610) allocated for xxxx:xx:xx.x
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCI: 1363: Skipping device reset on xxxx:xx:xx.x; no reset method found
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 5333: xxxx:xx:xx.x :Reset for device failed with Not supported
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 1039: pcipdevInfo: 0x431f80001610 (xxxx:xx:xx.x), state 0, destroyed
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 4636: pcipDevInfo(0x431f80001610) allocated for xxxx:xx:xx.x
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCI: 1363: Skipping device reset on xxxx:xx:xx.x; no reset method found
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 5333: xxxx:xx:xx.x :Reset for device failed with Not supported
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 1039: pcipdevInfo: 0x431f80001610 (xxxx:xx:xx.x), state 0, destroyed
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 4636: pcipDevInfo(0x431f80001610) allocated for xxxx:xx:xx.x
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCI: 1363: Skipping device reset on xxxx:xx:xx.x; no reset method found
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 5333: xxxx:xx:xx.x :Reset for device failed with Not supported
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 1039: pcipdevInfo: 0x431f80001610 (xxxx:xx:xx.x), state 0, destroyed
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCIPassthru: 4636: pcipDevInfo(0x431f80001610) allocated for xxxx:xx:xx.x
[YYYY-MM-DDTHH:MM:SS] In(182) vmkernel: cpu5:2104019)PCI: 1363: Skipping device reset on xxxx:xx:xx.x; no reset method found

Cause

  • During VM power-on, the passthrough GPU is reset, triggering hot-plug interrupts on the PCIe slot above the GPU.
  • The PCIe device disappears after a hot-remove interrupt occurs during the GPU reset.

Resolution

This is a known issue. Currently there is no resolution.

To workaround this issue follow the below steps,

NOTE: Setting enablePCIEHotplug=FALSE prevents ESXi from enabling hot-plug during server boot, even if the hardware supports it.

 

  1. Disable PCIe Hot-Plug by running the following command on the ESXi host:

    esxcli system settings kernel set -s "enablePCIEHotplug" -v "FALSE"

  2. Modify the passthrough configuration

    • Backup the file
      cp /etc/vmware/passthru.map /etc/vmware/passthru.map.bak

    • Edit the file using vi/etc/vmware/passthru.map.
    • Locate the line:

      # NVIDIA (FLR issue on Ampere and Hopper GPUs)
      10de ffff bridge false

    • Change it to:
       
      # NVIDIA (FLR issue on Ampere and Hopper GPUs)
      10de ffff default false

  3. Reboot the ESXi host to apply the changes.
  4. Verify that PCIe device hot-plug is disabled by entering the command:

    esxcli system settings kernel list -o enablePCIEHotplug

    The entry, "FALSE," should be displayed under the Configured column:

    Name                Type     Configured  Runtime Default  Description 
    ------------        ----     ---------   ------  -------  -----------
    enablePCIEHotplug   Bool      FALSE      FALSE    TRUE   Enable PCI-E Native Hotplug support