Unable to create or modify Supervisor Namespaces when Reserved GPU VM Class Slots are exhausted
search cancel

Unable to create or modify Supervisor Namespaces when Reserved GPU VM Class Slots are exhausted

book

Article ID: 435591

calendar_today

Updated On:

Products

VCF Private AI Services

Issue/Introduction

In a Workload Control Plane (WCP) environment with GPU-reserved Namespaces, you may encounter the following two problems:

  • New Namespace creation fails with the error:

    'Requested reserved VM Class count 2 is more than available reserved VM Class count 0'

    This occurs when all reserved GPU VM Class Slots in a zone are already allocated to an existing Namespace, leaving no slots available for a new Namespace.

  • Namespace Class downsizing (modification) fails after the Namespace has already been created. Attempting to edit the 'SupervisorNamespaceClassConfig' object results in:

    'supervisornamespaceclassconfigs.infrastructure.cci.vmware.com "<name>" is invalid'

    Even when you attempt to reduce the resource requirements (e.g., reduce 'guaranteed-2xlarge4gpu-h200' VMClass count from 2 to 1), the modification is rejected.

Note: This article applies to environments running VCF 9.0 with GPU-reserved Supervisor Namespaces managed via CCI (Cluster Cloud Infrastructure).

Environment

  • VCF Private AI Services

Cause

  • New Namespace creation failure:

    This error occurs because all reserved GPU VM Class slots in the backing cluster are already allocated to an existing Namespace. Even if some of those slots are unused (not linked to any running VM), they are still reserved and unavailable to other Namespaces. The current error message does not specify which VMClass type is exhausted or which zone/cluster is affected, making diagnosis difficult.

  • Namespace Class modification failure:

    Modifying a Namespace Class after a Namespace has been created from it is not supported in the current release (VCF 9.0). The WCP/CCI layer rejects modifications to a Namespace Class that is already in use. Furthermore, even if the request were to reach the DRS layer, DRS does not support runtime modification of GPU Class reservations after a Namespace has been created — DRS returns a `NotFailure` exception to the WCP caller.

Resolution

Since Namespace Class modification is not supported after a Namespace has been created, the recommended workaround is to recreate the Namespace with a properly sized Namespace Class.

Note: Resolution steps below require powering off any running VMs in the affected Namespaces, hence plan the maintenance accordingly.

Steps:

  1. Power off all VMs running inside the affected Namespace(s).
  2. Delete the existing Namespace(s) that were created from the oversized Namespace Class.
  3. Create a new Namespace Class with the desired (downsized) resource specification — for example, reduce 'guaranteed-2xlarge4gpu-h200' from 2 to 1.

    Note: Modifying an existing Namespace Class is not supported; a new Namespace Class must be created.

  4. Recreate the Namespace(s) using the new, correctly sized Namespace Class.
  5. After these steps, the freed-up GPU slots can be redistributed across multiple Namespaces as intended.