NVIDIA gpu-operator pods not starting in air-gapped environment with VKr 1.33
Note: This was working in VKr 1.32
kubectl get pods -n gpu-operatorNAME READY STATUS RESTARTS AGEgpu-feature-discovery-####### 0/1 Init:0/1 0 53mgpu-operator-cc6f6f8c7-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-gc-5858556f6f-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-master-7854d87bcf-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60mgpu-operator-node-feature-discovery-worker-####### 1/1 Running 0 60mnvidia-container-toolkit-daemonset-####### 1/1 Running 0 59mnvidia-dcgm-exporter-####### 0/1 Init:0/1 0 52mnvidia-device-plugin-daemonset-####### 0/1 Init:0/1 0 52mnvidia-driver-daemonset-####### 1/1 Running 0 60mnvidia-operator-validator-####### 0/1 Init:0/4 0 52m
kubectl describe pod gpu-feature-discovery-####### -n gpu-operatorEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 53m default-scheduler Successfully assigned gpu-operator/gpu-feature-discovery-xm629 to #######-gpu-worker-7bnrh-scg2g-lbfdr Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "#######": failed to get sandbox image "registry.k8s.io/pause:3.10": failed to pull image "registry.k8s.io/pause:3.10": failed to pull and unpack image "registry.k8s.io/pause:3.10": failed to resolve reference "registry.k8s.io/pause:3.10": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.10": dial tcp: lookup registry.k8s.io on #######:53: server misbehaving
vSphere 8.0U3
When containerd merges config 2 and 3, the sandbox_image field in config 2 is not converted and hence not used, resulting in containerd to use the default pause image.
Issue will be fixed in a future VKr
Workaround for VKR 1.33 and higher (before installing GPU operator):
Configure environment variable RUNTIME_CONFIG_SOURCE to point to /etc/containerd/config.toml which will fix the config version mismatch and effectively overrides the default download behaviour for the registry.k8s.io/pause:3.10 image.
Use the following parameters to implement this change.
helm install [RELEASE_NAME] [CHART] \ --set toolkit.env[0].name=RUNTIME_CONFIG_SOURCE \ --set toolkit.env[0].value=file=/etc/containerd/config.toml