VM Fails to Power On and NVSWITCH Device Disappears from Host After PCI Passthrough Attempt

Products

VMware vSphere ESXi

Issue/Introduction

When attempting to pass through NVSWITCH PCI devices to a virtual machine, users may experience a situation where the VM consistently fails to boot and the NVSWITCH device disappears from the ESXi host's PCI device list after each failed attempt. This forces administrators to reboot the host to restore visibility of the NVSWITCH device. The issue particularly impacts environments where NVIDIA GPUs and NVSWITCH devices are used together for high-performance computing applications.

Typically, after each VM power-on attempt, users observe that:
- The VM fails to power on with a "Module DevicePowerOn power on failed" error
- The NVSWITCH device that was assigned for passthrough is no longer visible in the host's hardware list
- Subsequent attempts to power on the VM fail with the same pattern, with more devices potentially disappearing
- A host reboot is required to restore visibility of the NVSWITCH devices

Other errors and log messages may include:

"PCIPassthru: Failed to get NumaNode for sbdf [device ID]"
"PCIPassthru: Selected device [device ID] is outside of the NUMA configuration"
"Failed to find a suitable device for pciPassthru0"

Environment

ESXi 8.0 and later
NVIDIA GPUs (H100, H200 or similar models) configured for passthrough
NVIDIA NVSwitch devices (if present)
Systems with hot-plug capable PCIe slots (particularly common in enterprise servers with PCIe switches)

Cause

The issue occurs because some PCIe slots are incorrectly configured for hotplug by the firmware, despite not having proper hardware support for out-of-band device presence detection. Instead, they use in-band presence detection, which is not officially supported by VMware or recommended by PCI-SIG.

During VM power-on, the PCI passthrough process attempts to reset the device, which triggers a link reset. When the link goes down during this reset, the hotplug-enabled slot incorrectly reports that the device has been removed. This results in a hotplug event on the host, causing the device to be removed from the system unexpectedly.

Resolution

The following work-around resolves this issue until a permanent fix is available in a future ESXi release:

Disable PCIe hotplug system-wide:

Connect to your ESXi host via SSH

Run the following command:

   esxcli system settings kernel set -s "enablePCIEHotplug" -v "FALSE"

Reboot the ESXi host for the changes to take effect
After the host reboots, attempt to power on the VM with the passthrough devices again

Note: This workaround will disable hot-plug capability for all PCIe devices on the host, meaning you won't be able to add or remove PCIe devices without rebooting the host. However, for environments affected by this issue, this limitation is typically acceptable as it resolves the critical problem with NVSWITCH devices disappearing.

Additional Information

NVSwitch and NVLink Passthrough Requirements

Critical Requirement: When using GPU passthrough mode (DirectPath I/O), ALL GPUs that are physically connected to each other through NVLink must be assigned to the same VM, regardless of whether you intend to use NVLink functionality or not. This is a hardware and driver limitation from NVIDIA, not a VMware restriction.

Why This Requirement Exists: If only a subset of NVLink-connected GPUs is passed through to a VM, an unrecoverable XID 74 error occurs when the VM boots. This error corrupts the NVLink state on the physical GPUs and renders the NVLink bridge unusable, requiring a GPU reset or host reboot to restore functionality.

For Systems with NVSwitch Technology:

vSphere 8 Update 1 and later supports up to 8 GPUs per host connected via NVSwitch
Up to 8 GPUs can be assigned to the same VM for high-bandwidth communication (up to 900GB/s bidirectional with H100 GPUs)
Device Groups automatically present sets of NVLink-connected GPUs as unified units for easier management
Example: For the configuration shown in your colleague's environment (8 H100 GPUs + 4 NVSwitches), all 8 GPUs must be passed through to a single VM in passthrough mode, even if you only want to use them as independent compute devices

Alternative Configuration Options

VMware supports GPU passthrough as documented in this KB article. For other GPU virtualization methods and their specific configuration requirements, please refer to NVIDIA's official documentation and support channels, as these configurations fall under NVIDIA's support domain.

NVLink Topology and Configuration Guidance

Before configuring GPU passthrough:

Use nvidia-smi topo -m on the ESXi host to view the NVLink topology
Verify which GPUs are interconnected via NVLink/NVSwitch
Ensure the NVSwitch and GPU configuration matches your workload requirements
Check BIOS settings that may affect PCIe slot configuration and NVLink connectivity

Required VM Configuration Parameters

When using NVSWITCH devices for NVLink functionality, add these advanced parameters to your VM configuration:

pciPassthru.allowP2P = "TRUE"
pciPassthru.use64bitMMIO = "TRUE"
pciPassthru.64bitMMIOSizeGB = "256" (increase if needed for multiple high-memory GPUs)

For H200 GPUs with large video memory (141GB each) and NVSwitches, you may need to set a larger MMIO space. For example, with 8 H200 GPUs (141GB × 8 = 1,128GB), set pciPassthru.64bitMMIOSizeGB = "2048" (next power of 2).

Troubleshooting

If problems persist after implementing the hotplug workaround:

Verify that all NVLink-connected GPUs are assigned to the same VM
Check BIOS settings to disable hot-plug capability for the PCIe slots in question
Consider switching to vGPU mode rather than direct PCI passthrough for better stability
Use nvidia-smi nvlink --status to verify NVLink connectivity status