Unable to deploy Nvidia operator on the TKC: Failed to install CRD crds/nvidia.com_clusterpolicies.yaml
search cancel

Unable to deploy Nvidia operator on the TKC: Failed to install CRD crds/nvidia.com_clusterpolicies.yaml

book

Article ID: 394839

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service vSphere with Tanzu

Issue/Introduction

Error while deploying Nvidia Operator to use AI/ML Workloads on TKG Service Clusters:

INSTALLATION FAILED: failed to install CRD crds/nvidia.com_clusterpolicies.yaml: Post "https://10.1x.x.x:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?fieldManager=helm": http2: client connection lost" 

You may also encounter the following error on some Nvidia Operator pods:

Error: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured.

Environment

vSphere with Tanzu 8.x

Cause

The NVIDIA Operator has multiple components that require additional Custom Resource Definitions (CRDs) for proper operation.
One of these components is the NVIDIA Container Toolkit, which provides the nvidia-container-runtime needed for the containers to run.

To deploy the Nvidia Operator, access to online repositories to pull the required images and CRDs is required.

Resolution

Additional Information