In vSphere 7.x and 8.x environments, virtual machines configured with multiple vGPUs using the nvidia_h100xm-80c profile may fail to power on with the following error:
Multiple vGPU devices must be within a device group on a NVSwitch system.
This occurs when assigning more than one full-size time-sliced vGPU (e.g., nvidia_h100xm-80c) to a single VM without using a supported NVLink-enabled device group.
Example Error in vmware.log:Power on failure messages: Multiple vGPU devices must be within a device group on a NVSwitch system.
Module 'PCIPluginLate' power on failed.
Failed to start the virtual machine.
Example syslog messages:logrotate[xxxxx]: GPU manager activated FM partition id 7 for GPU 0000:0a:00.0
logrotate[xxxxx]: GPU manager activated FM partition id 8 for GPU 0000:18:00.0
...
GPU manager deactivate FM partition id 8 failed: -4
...
NodeId 0 partition id 8 is deactivated.
This issue is caused by attempting to assign multiple individual full-size vGPUs (nvidia_h100xm-80c)
to a single VM without grouping them into an NVSwitch-enabled device group.
The nvidia_h100xm-80c
is a full-size time-sliced vGPU profile that supports NVLink. When two such devices are attached to a VM, they are expected to be within a device group to allow proper NVLink communication via NVIDIA Fabric Manager. If not grouped, the system fails to validate the topology, and the VM power-on fails.
Option 1: Use NVSwitch-enabled Device Groups
To successfully assign multiple full-size vGPUs with NVLink support, configure the VM using an H100 device group (e.g., 2x, 4x, or 8x profiles) instead of assigning individual vGPUs.
Device groups ensure the assigned vGPU devices share a common NVLink fabric partition, enabling high-bandwidth peer communication.
Option 2: Disable Device Group Requirement (If NVLink is not needed)
If NVLink is not required for the VM workload, and you are using vGPU driver version 17.0 or newer, you can bypass the device group requirement by adding the following advanced setting to the VM’s .vmx configuration:
vmiop.nvswitch.deviceGroupRequired = "FALSE"
This will allow each vGPU device to be placed in separate Fabric Manager partitions, disabling NVLink between them.
Steps:
MIG vs Time-Sliced Profiles:
The key difference between nvidia_h100xm_7-80c (MIG profile) and nvidia_h100xm-80c (time-sliced profile)
is that only the full-size time-sliced profile supports NVLink connections. MIG profiles do not.
Device Grouping Requirement:
Multiple vGPUs with NVLink support must belong to a single device group on an NVSwitch-enabled system. Failing to do so results in GPU manager partition failures and power-on errors.
Reference Configuration (Failing Example):
pciPassthru0.vgpu = "nvidia_h100xm-80c"
pciPassthru0.pgpu = "233016C100000003"
pciPassthru1.vgpu = "nvidia_h100xm-80c"
pciPassthru1.pgpu = "233016C100000003"
Log Evidence:
The log shows that the VM creates two Fabric Manager partitions (ID 7 and ID 8) for the two vGPUs, which violates the expected single group behavior and causes the failure.