VM Fails to Power On with Multi vGPU Profile nvidia_h100xm-80c – Error: “Multiple vGPU devices must be within a device group on a NVSwitch system”
search cancel

VM Fails to Power On with Multi vGPU Profile nvidia_h100xm-80c – Error: “Multiple vGPU devices must be within a device group on a NVSwitch system”

book

Article ID: 398372

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In vSphere 7.x and 8.x environments, virtual machines configured with multiple vGPUs using the nvidia_h100xm-80c profile may fail to power on with the following error:
Multiple vGPU devices must be within a device group on a NVSwitch system.

This occurs when assigning more than one full-size time-sliced vGPU (e.g., nvidia_h100xm-80c) to a single VM without using a supported NVLink-enabled device group.

Example Error in vmware.log:
Power on failure messages: Multiple vGPU devices must be within a device group on a NVSwitch system.
Module 'PCIPluginLate' power on failed.
Failed to start the virtual machine.

Example syslog messages:
logrotate[xxxxx]: GPU manager activated FM partition id 7 for GPU 0000:0a:00.0
logrotate[xxxxx]: GPU manager activated FM partition id 8 for GPU 0000:18:00.0
...
GPU manager deactivate FM partition id 8 failed: -4
...
NodeId 0 partition id 8 is deactivated.

Environment

  •     VMware vSphere 7.x
  •     VMware vSphere 8.x
  •     NVIDIA H100 with NVSwitch topology
  •     VMs configured with multiple vGPUs (time-sliced profiles)

Cause

This issue is caused by attempting to assign multiple individual full-size vGPUs (nvidia_h100xm-80c) to a single VM without grouping them into an NVSwitch-enabled device group.

The nvidia_h100xm-80c is a full-size time-sliced vGPU profile that supports NVLink. When two such devices are attached to a VM, they are expected to be within a device group to allow proper NVLink communication via NVIDIA Fabric Manager. If not grouped, the system fails to validate the topology, and the VM power-on fails.

Resolution

Option 1: Use NVSwitch-enabled Device Groups

To successfully assign multiple full-size vGPUs with NVLink support, configure the VM using an H100 device group (e.g., 2x, 4x, or 8x profiles) instead of assigning individual vGPUs.

    Device groups ensure the assigned vGPU devices share a common NVLink fabric partition, enabling high-bandwidth peer communication.

Option 2: Disable Device Group Requirement (If NVLink is not needed)

If NVLink is not required for the VM workload, and you are using vGPU driver version 17.0 or newer, you can bypass the device group requirement by adding the following advanced setting to the VM’s .vmx configuration:

vmiop.nvswitch.deviceGroupRequired = "FALSE"

    This will allow each vGPU device to be placed in separate Fabric Manager partitions, disabling NVLink between them.

Steps:

  •     Power off the VM.
  •     Edit the VM’s .vmx file or use advanced configuration via the vSphere UI.
  •     Add the above key-value pair.
  •     Save and power on the VM.

Additional Information

MIG vs Time-Sliced Profiles:
    The key difference between nvidia_h100xm_7-80c (MIG profile) and nvidia_h100xm-80c (time-sliced profile) is that only the full-size time-sliced profile supports NVLink connections. MIG profiles do not.

    Device Grouping Requirement:
    Multiple vGPUs with NVLink support must belong to a single device group on an NVSwitch-enabled system. Failing to do so results in GPU manager partition failures and power-on errors.

    Reference Configuration (Failing Example):

pciPassthru0.vgpu = "nvidia_h100xm-80c"
pciPassthru0.pgpu = "233016C100000003"
pciPassthru1.vgpu = "nvidia_h100xm-80c"
pciPassthru1.pgpu = "233016C100000003"

    Log Evidence:
    The log shows that the VM creates two Fabric Manager partitions (ID 7 and ID 8) for the two vGPUs, which violates the expected single group behavior and causes the failure.