When managing VMs that use NVIDIA vGPUs through Private AI Foundation (PAIF), one or more of the following symptoms are observed:
dcli com vmware vcenter namespacemanagement supervisors zones bindings update --supervisor <supervisor UUID> --resource-allocation-vm-reservations '<reserved vm class configuration>' --zone <domain-c#>
messages:
- severity: INFO
details:
error reconciling Zone reservation; the failed operation will be retried: The cpu limit value specified in the config spec ('-1') is invalid
messages:
- severity: INFO
details:
error reconciling Zone reservation; the failed operation will be retried: Insufficient resources
vim.VirtualMachine.powerOn: :vim.fault.NoCompatibleHost
dcli com vmware vcenter namespacemanagement supervisors zones bindings list --supervisor <supervisor UUID>
In the vSphere web UI, the Cluster -> Monitor -> DirectPath Profiles Utilization page shows that all resources are prohibited and/or consumed, where the number of prohibited or consumed resources matches the total number of reservations available in the environment:
As an example, the below screenshot shows that all 4 resources are prohibited.
Vmware Private AI Foundation (PAIF)
vSphere Cloud Foundation Automation (VCFA)
vSphere Supervisor 9
There is a discrepancy between the VMClass reservation recognized by the Supervisor and the actual reservations in use.
Follow the below steps to verify that there is a reservation discrepancy in the environment:
/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
VCDB=# select vpxv_vms.vmid, vpxv_vms.NAME, vpxv_vms.hostid, vpxv_hosts.NAME from vpxv_vms JOIN vpxv_hosts on VPXV_VMS.HOSTID = VPXV_HOSTS.HOSTID where ((vpxv_hosts.hostId = vpxv_vms.hostid) and (vpxv_vms.NAME = 'vcsa'));
VCDB=# select * from vpx_resource_profile_slot where resourcepool_id = <resource pool ID>
Work with VMware by Broadcom Technical Support to correct the reservation discrepancy.
This involves editing the environment's database which can be destructive and should only be performed with Technical Support.
Provide information on the following:
dcli com vmware vcenter namespacemanagement supervisors zones bindings list --supervisor <supervisor UUID>
There is a known issue in vCenter 9.0 where if multiple custom vmclasses have been created but with the same resource configuration, the system attempts to avoid duplication by pointing the custom vmclasses to the same internal resource object. However, this means that after a restart of VPXD, this pointer can be incorrectly directed to the wrong vm resource profile and vGPU VMs will fail to provision.
This known issue will be fixed in an upcoming version of vCenter 9.0 which will also include automatic clean up of unused vm resource profiles.