vCenter fails to display or enable NVIDIA H100 NVSwitch devices with "GetDeviceID failed" error
search cancel

vCenter fails to display or enable NVIDIA H100 NVSwitch devices with "GetDeviceID failed" error

book

Article ID: 437683

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • In vCenter Server, NVIDIA GH100 [H100 NVSwitch] devices are not displayed correctly in the PCI Device interface or appear inconsistently.

  • NVIDIA GH100 (H100 SXM5 80GB) and H100 NVSwitch devices are enumerated correctly in the ESXi command line (via lspci) but do not appear correctly in the vCenter Server hardware inventory

  • When attempting to toggle passthrough for NVSwitch devices in the vSphere Client, the operation fails with: 

    "Operation failed! An error occurred during host configuration: Operation failed, diagnostics report: GetDeviceID failed."



  • The /var/run/log/hostd.log on the affected host contains the following entries

    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Hostsvc opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Config wants to enable passthrough for  ####:##:##:#
    YYYY-MM-DDTHH:MM:SS Wa(164) Hostd[#####]: [Originator@6876 sub=Libs opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] VmkCtl: GetDeviceID failed for ####:##:##:# Device-ID not found!
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Hostsvc opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Re-fetching all PCI devices since SR-IOV configuration has been updated
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Hostsvc.AssignableHardwareProvider opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] AH Generating NVIDIA VDG device complex spec
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Libs opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] NvidiaVgpuInfo: Failed to open nvidia library
    YYYY-MM-DDTHH:MM:SS Wa(164) Hostd[#####]: [Originator@6876 sub=Libs opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] NvidiaDeviceGroupInfo: vgpuInfo not available.
    YYYY-MM-DDTHH:MM:SS Wa(164) Hostd[#####]: [Originator@6876 sub=Hostsvc.AssignableHardwareProvider opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] AH Device complex generator directory path /usr/lib/vmware/vdg/bin doesn't exist or is not a directory
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Hostsvc.AssignableHardwareProvider opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] AH Dtree deviceGroup identical (devices/types): pci: 8/1 QAT: 0/0 VDG: 0/0
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Hostsvc opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Populating NUMA PCI ids ...
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=AdapterServer opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] AdapterServer caught exception; <<52####d9-a##0-1##9-1##d-02###61d15, <TCP '127.0.0.1 : 8307'>, <TCP '127.0.0.1 : 51549'>>, ha-pcipassthrusystem, vim.host.PciPassthruSystem.updatePassthruConfig,<vim.version.v8_0 internal, 8.0.3.0>, [N11HostdCommon18VmomiAdapterServer19ActivationResponderE:0x00000090abc6a3f8]>, N7Hostsvc21HaPlatformConfigFault9ExceptionE(Fault cause: vim.fault.PlatformConfigFault

    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: [Originator@6876 sub=Solo.Vmomi opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Arg config:
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: --> (vim.host.PciPassthruConfig) [
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: -->    (vim.host.PciPassthruConfig) {
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: -->       id = "####:##:##:#",
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: -->       passthruEnabled = true,
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: -->    }
    YYYY-MM-DDTHH:MM:SS Db(167) Hostd[#####]: --> ]
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Solo.Vmomi opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Throw vim.fault.PlatformConfigFault
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: [Originator@6876 sub=Solo.Vmomi opID=mj###ujq-20###481-auto-##r7-h5:720####46-82-f9-c##d sid=52####2d9 user=vpxuser:DOMAIN\USER] Result:
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: --> (vim.fault.PlatformConfigFault) {
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->    faultMessage = (vmodl.LocalizableMessage) [
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->       (vmodl.LocalizableMessage) {
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->          key = "com.vmware.esx.hostctl.default",
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->          arg = (vmodl.KeyAnyValue) [
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->             (vmodl.KeyAnyValue) {
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->                key = "reason",
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->                value = "GetDeviceID failed."
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->             }
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->          ],
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->       }
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->    ],
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->    text = "",
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: -->    msg = ""
    YYYY-MM-DDTHH:MM:SS In(166) Hostd[#####]: --> }

     

Cause

This issue occurs because certain enterprise server PCIe slots are incorrectly configured by firmware for hotplug support without having the necessary hardware for out-of-band presence detection. When vCenter attempts to initialize or reset the NVSwitch during passthrough configuration, the link reset triggers a false "device removed" hotplug event. This causes the host to lose track of the device ID during the configuration task, resulting in the GetDeviceID failed error.

Resolution

To resolve this issue, disable the native PCIe hotplug interrupt and adjust the passthrough mapping for NVIDIA Hopper-series GPUs.

  1. Disable PCIe Hot-Plug: Log in to the ESXi host via SSH and run the following command to disable global PCIe hotplugging: esxcli system settings kernel set -s "enablePCIEHotplug" -v "FALSE"
  2. Host Reboot: Reboot the ESXi host to apply the kernel settings and the new passthrough configuration