Unable to Create or Use Custom vmclass Reservations in Private AI Foundation (PAIF)
search cancel

Unable to Create or Use Custom vmclass Reservations in Private AI Foundation (PAIF)

book

Article ID: 418632

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VCF Private AI Services

Issue/Introduction

When managing VMs that use NVIDIA vGPUs through Private AI Foundation (PAIF), one or more of the following symptoms are observed:

  • Editing, updating or creating a custom VM class for Private AI Foundation (PAIF), you see one of the following errors:
    dcli com vmware vcenter namespacemanagement supervisors zones bindings update --supervisor <supervisor UUID> --resource-allocation-vm-reservations '<reserved vm class configuration>' --zone <domain-c#>
    
    messages:
       - severity: INFO
           details:
         error reconciling Zone reservation; the failed operation will be retried: The cpu limit value specified in the config spec ('-1') is invalid
    
    messages:
       - severity: INFO
           details:
         error reconciling Zone reservation; the failed operation will be retried: Insufficient resources

     

  • Powering on a VM using a reserved VM class for GPU enabled Kubernetes cluster fails. When viewing logs such as VPXD logs, the below error is reported:
    vim.VirtualMachine.powerOn: :vim.fault.NoCompatibleHost

     

  • The dedicated Zone is in READY state with available VMClass reservations but the system fails to power on the VM due to insufficient resources and NoCompatibleHost:
    dcli com vmware vcenter namespacemanagement supervisors zones bindings list --supervisor <supervisor UUID>

In the vSphere web UI, the Cluster -> Monitor -> DirectPath Profiles Utilization page shows that all resources are prohibited and/or consumed, where the number of prohibited or consumed resources matches the total number of reservations available in the environment:

As an example, the below screenshot shows that all 4 resources are prohibited.

Environment

Vmware Private AI Foundation (PAIF)

vSphere Cloud Foundation Automation (VCFA)

vSphere Supervisor 9

Cause

There is a discrepancy between the VMClass reservation recognized by the Supervisor and the actual reservations in use.

Resolution

Follow the below steps to verify that there is a reservation discrepancy in the environment:

  1. Connect to the vCenter Server Database (VCDB):
    /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

     

  2. Run the below command to query all virtual GPU VMs present by ESXi host ID according to the VCDB:
    VCDB=# select vpxv_vms.vmid, vpxv_vms.NAME, vpxv_vms.hostid, vpxv_hosts.NAME from vpxv_vms JOIN vpxv_hosts on VPXV_VMS.HOSTID = VPXV_HOSTS.HOSTID where ((vpxv_hosts.hostId = vpxv_vms.hostid) and (vpxv_vms.NAME = 'vcsa'));

     

  3. For any resource pool(s) associated with the affected Zone, check actual reservation usage:
    VCDB=# select * from vpx_resource_profile_slot where resourcepool_id = <resource pool ID>

 

Work with VMware by Broadcom Technical Support to correct the reservation discrepancy.

This involves editing the environment's database which can be destructive and should only be performed with Technical Support.

Provide information on the following:

  • A list of all DirectPathProfile (DPP) and the related type in the environment

  • All reservation vmclasses in the environment

  • All ESXi hosts configured to use virtual GPUs in the environment

  • The affected VM UID and ESXi host UIDs

  • The ID of all resource pools associated with the affected Zone.

  • The output of the following command when run from the vCenter Server Appliance (VCSA):
    dcli com vmware vcenter namespacemanagement supervisors zones bindings list --supervisor <supervisor UUID>


  • The status of each DirectPathProfile (DPP) from the DirectPath Profile Chart
    • A gray status indicates that the associated resources are prohibited or constrained. This may indicate that the DPP has reserved existing resources or there may be other GPU virtual devices using the corresponding physical resources. "Prohibited"

    • An orange status indicates that the virtual GPU resources for this DPP are currently in use. "Consumed"

    • A blue status indicates that there are virtual GPU resources for this DPP that are available for reservation. "Remaining"

Additional Information

There is a known issue in vCenter 9.0 where if multiple custom vmclasses have been created but with the same resource configuration, the system attempts to avoid duplication by pointing the custom vmclasses to the same internal resource object. However, this means that after a restart of VPXD, this pointer can be incorrectly directed to the wrong vm resource profile and vGPU VMs will fail to provision.

  1. Do not create multiple custom vmclasses that have the same configuration/spec.

  2. Reach out to VMware by Broadcom Technical Support for assistance on cleaning up duplicate vm resource profiles for custom vmclasses.

This known issue will be fixed in an upcoming version of vCenter 9.0 which will also include automatic clean up of unused vm resource profiles.