NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33 (or higher) showing error "failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": "
search cancel

NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33 (or higher) showing error "failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": "

book

Article ID: 429604

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33

Note: This was working in VKr 1.32

 

kubectl get pods -n gpu-operator
NAME                                                            READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-#######                                   0/1     Init:0/1   0          53m
gpu-operator-cc6f6f8c7-#######                                  1/1     Running    0          60m
gpu-operator-node-feature-discovery-gc-5858556f6f-#######       1/1     Running    0          60m
gpu-operator-node-feature-discovery-master-7854d87bcf-#######   1/1     Running    0          60m
gpu-operator-node-feature-discovery-worker-#######              1/1     Running    0          60m
gpu-operator-node-feature-discovery-worker-#######              1/1     Running    0          60m
gpu-operator-node-feature-discovery-worker-#######              1/1     Running    0          60m
gpu-operator-node-feature-discovery-worker-#######              1/1     Running    0          60m
gpu-operator-node-feature-discovery-worker-#######              1/1     Running    0          60m
nvidia-container-toolkit-daemonset-#######                      1/1     Running    0          59m
nvidia-dcgm-exporter-#######                                    0/1     Init:0/1   0          52m
nvidia-device-plugin-daemonset-#######                          0/1     Init:0/1   0          52m
nvidia-driver-daemonset-#######                                 1/1     Running    0          60m
nvidia-operator-validator-#######                               0/1     Init:0/4   0          52m

 

kubectl describe pod gpu-feature-discovery-####### -n gpu-operator

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 53m default-scheduler Successfully assigned gpu-operator/gpu-feature-discovery-xm629 to #######-gpu-worker-7bnrh-scg2g-lbfdr
  Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "#######": failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": failed to pull and unpack image "registry.k8s.io/pause:3.10": failed to resolve reference "registry.k8s.io/pause:3.10": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.10": dial tcp: lookup registry.k8s.io on #######:53: server misbehaving

Environment

vSphere 8.0U3

Cause

When containerd merges config 2 and 3, the sandbox_image field in config 2 is not converted and hence not used, resulting in containerd to use the default pause image. 

Resolution

Issue will be fixed in a future VKr

 

Workaround for VKR 1.33 and higher (before installing GPU operator):

 

Configure environment variable RUNTIME_CONFIG_SOURCE to point to /etc/containerd/config.toml which will fix the config version mismatch and effectively overrides the default download behaviour for the registry.k8s.io/pause:3.10 image.

 

Use the following parameters to implement this change.

helm install [RELEASE_NAME] [CHART] \
  --set toolkit.env[0].name=RUNTIME_CONFIG_SOURCE \
  --set toolkit.env[0].value=file=/etc/containerd/config.toml