VCF 9.x Private AI Foundation (PAIF) deployment fails to apply Supervisor Namespace Class configuration
search cancel

VCF 9.x Private AI Foundation (PAIF) deployment fails to apply Supervisor Namespace Class configuration

book

Article ID: 429262

calendar_today

Updated On:

Products

VCF Private AI Services

Issue/Introduction

  • During the deployment of VMware Private AI Foundation (PAIF) with NVIDIA on VMware Cloud Foundation (VCF) 9.x, users may encounter a failure when creating or applying a custom SupervisorNamespaceClass and SupervisorNamespaceClassConfig.

  • Especially, applying the SupervisorNamespaceClassConfig fails with the error snippet below

    Result:
    {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {},
      "status": "Failure",
      "message": "supervisornamespaceclassconfig.infrastructure.cci.vmware.com is forbidden: User \"<user-name>\" cannot create resource \"supervisornamespaceclassconfig\" in API group \"infrastructure.cci.vmware.com\" at the cluster scope",
      "reason": "Forbidden",
      "details": {
        "group": "infrastructure.cci.vmware.com",
        "kind": "supervisornamespaceclassconfig"
      },
      "code": 403
    }

  • As a result, applying the configuration via API fails with a permissions error, and the system is unable to resolve the vGPU VM classes to the designated vSphere Zone.

Environment

VMware Cloud Foundation 9.x
VMware Private AI Foundation

Cause

The Supervisor service cannot locate the specified VM classes despite them being defined in the region. The primary cause is a configuration mismatch in the vSphere Zone binding. In VCF 9.x PAIF deployments, the Workflow expects the vSphere Zone name to match the Cluster-ID of the workload cluster. In case the zone name does not match the Cluster-ID since it is explicitly named something else other than the Cluster-ID , the Supervisor cannot reconcile the VM reservation classes with the physical compute resources, leading to the API rejection at the cluster scope.

Resolution

The vSphere Zone must be configured as a single zone where the name is left to default to the Cluster-ID.

  1. Validate the zone Naming: Log in to the Supervisor Cluster and check the existing zones. You can do the same by Running the command below

    k get vspherezones.topology.tanzu.vmware.com -A

  2. If the output shows a custom name, it must be corrected. Reconfigure the vSphere Zone and ensure that the name reflects the correct cluster-ID.

  3. Post this, correct the SupervisorNamespaceClassConfig. Ensure the zones section of your JSON payload references the correct Cluster-ID.

    "zones": [
        {
            "cpuLimit": "",
            "cpuReservation": "",
            "memoryReservation": "",
            "name": "<CLUSTER_ID>", 
            "vmClassReservation": [
                {
                    "count": 1,
                    "vmClassName": "<vm-class-name>"
                }
            ]
        }
    ]

  4. Retry the POST request to the CCI endpoint.

Additional Information

All the recommendations above are in line with the official tech doc- Configure vGPU-Based VM Classes for AI Workloads for VMware Private AI Foundation with NVIDIA