NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33 (or higher) showing error "failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": "

search cancel

NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33 (or higher) showing error "failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": "

book

Article ID: 429604

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33

Note: This was working in VKr 1.32

kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-####### 0/1 Init:0/1 0 53m
gpu-operator-cc6f6f8c7-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-gc-5858556f6f-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-master-7854d87bcf-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60m
gpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60m
nvidia-container-toolkit-daemonset-####### 1/1 Running 0 59m
nvidia-dcgm-exporter-####### 0/1 Init:0/1 0 52m
nvidia-device-plugin-daemonset-####### 0/1 Init:0/1 0 52m
nvidia-driver-daemonset-####### 1/1 Running 0 60m
nvidia-operator-validator-####### 0/1 Init:0/4 0 52m

kubectl describe pod gpu-feature-discovery-####### -n gpu-operator

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53m default-scheduler Successfully assigned gpu-operator/gpu-feature-discovery-xm629 to #######-gpu-worker-7bnrh-scg2g-lbfdr
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "#######": failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": failed to pull and unpack image "registry.k8s.io/pause:3.10": failed to resolve reference "registry.k8s.io/pause:3.10": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.10": dial tcp: lookup registry.k8s.io on #######:53: server misbehaving

Environment

vSphere 8.0U3

Cause

When containerd merges config 2 and 3, the sandbox_image field in config 2 is not converted and hence not used, resulting in containerd to use the default pause image.

Resolution

Issue will be fixed in a future VKr

Workaround for VKR 1.33 and higher (before installing GPU operator):

Configure environment variable RUNTIME_CONFIG_SOURCE to point to /etc/containerd/config.toml which will fix the config version mismatch and effectively overrides the default download behaviour for the registry.k8s.io/pause:3.10 image.

Use the following parameters to implement this change.

helm install [RELEASE_NAME] [CHART] \
--set toolkit.env[0].name=RUNTIME_CONFIG_SOURCE \
--set toolkit.env[0].value=file=/etc/containerd/config.toml

Feedback

thumb_up Yes

thumb_down No