Nvidia vGPU operator fails to install successfully on Kubernetes clusters with vGPU functionality activated in Cloud Director Container Service Extension

search cancel

Nvidia vGPU operator fails to install successfully on Kubernetes clusters with vGPU functionality activated in Cloud Director Container Service Extension

book

Article ID: 325549

calendar_today

Updated On: 11-28-2023

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

A Kubernetes cluster has been created using Cloud Director Container Service Extension and the Activate GPU option has been enabled.
Attempting Nvidia vGPU operator installation on the Kubernetes cluster fails.
When reviewing the status of the pods in the gpu-operator namespace on the Kubernetes cluster, many of the pods fail to enter a valid status when verifying with a command such as:

kubectl get pods -n gpu-operator

Environment

VMware Cloud Director 10.x

Cause

This is a known issue in Cluster API Provider for VMware Cloud Director (CAPVCD) v1.1.0 and earlier when creating Kubernetes clusters in Container Service Extension.

Resolution

To resolve this issue use Cluster API Provider for VMware Cloud Director (CAPVCD) v1.1.1 or later when creating Kubernetes clusters in Container Service Extension.
CAPVCD v1.1.1 introduced changes to resolve the issue as outlined in the release notes here:

Cluster API Provider for VMware Cloud Director Release v1.1.1

To ensure that Container Service Extension is using CAPVCD v1.1.1 or later when creating Kubernetes clusters take the following steps:

Log into the Cloud Director Provider portal as a System Administration.
Open the Kubernetes Container Clusters plugin and open the CSE Management, Server Details tab.
Under Component Versions confirm that CAPVCD is set to v1.1.1 or higher.
To change the CAPVCD version follow the steps outlined in the Container Server Extension documentation:

Update Server Configuration

Example steps would be as follows:

Log into the Cloud Director Provider portal as a System Administration.
Open the Kubernetes Container Clusters plugin and open the CSE Management, Server Details tab.
Click Update Server, select the Update Configuration option, and click Next.
Under CSE Server Components increase the CAPVCD Version to the desired supported version, for example by typing v1.1.1.
Click Submit Changes to apply the change and click Back to return to Server Details.
Restart the existing VMware Cloud Director Container Service Extension Server vApp containing the Container Service Extension servers to apply the updated configuration.
Deploy a new Kubernetes cluster and confirm that it is using CAPVCD v1.1.1 by opening it in the Kubernetes Container Clusters plugin and viewing the Overview, Kubernernetes Resources section.

Additional Information

Cluster API Provider for VMware Cloud Director Release v1.1.1
Cloud Director Container Service Extension, Update Server Configuration
Cloud Director Container Service Extension, Configuring vGPU on Tanzu Kubernetes Grid Clusters to allow Artificial Intelligence and Machine Learning Workloads

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No