Deploying a TKG 2.5.4 - management cluster control plane's kubelet is stuck activating and containerd is unable to PullImage with error 'failed to pull and unpack image "<imageregistry>/etcd:v3.5.16_vmware.3": failed to resolve reference'
search cancel

Deploying a TKG 2.5.4 - management cluster control plane's kubelet is stuck activating and containerd is unable to PullImage with error 'failed to pull and unpack image "<imageregistry>/etcd:v3.5.16_vmware.3": failed to resolve reference'

book

Article ID: 422256

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Deploying a TKG 2.5.4, management cluster control plane is up but kubelet is stuck activating and containerd is unable to PullImage seeing a failed to pull and unpack image "<imageregistry>/etcd:v3.5.16_vmware.3": failed to resolve reference "<imageregistry>/etcd:v3.5.16_vmware.3".

After SSH to the stuck management cluster's control plane node and running 'crictl ps -a' no containers are observed or in Running state. After running 'crictl images' the imageregistry images are not visible.

And below are observed in the kubelet and containerd journalctl logs.

kubelet

Dec 12 09:27:20 mgmt-cluster-controlplane-##### kubelet[1747]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Dec 12 09:27:20 mgmt-cluster-controlplane-##### kubelet[1747]: I1212 09:27:20.444226    1747 server.go:209] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the>
Dec 12 09:27:20 mgmt-cluster-controlplane-##### kubelet[1747]: E1212 09:27:20.444335    1747 run.go:74] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet>
Dec 12 09:27:20 mgmt-cluster-controlplane-##### systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE


containerd 

Dec 12 10:11:07 mgmt-cluster-controlplane-##### containerd[1088]: time="2025-12-12T10:11:07.500046759Z" level=error msg="PullImage \"<imageregistry>/etcd:v3.5.16_vmware.3\" failed" error="failed to pull and unpack image \"<imageregistry>/etcd:v3.5.16_vmware.3\": failed to resolve reference \"<imageregistry>/etcd:v3.5.16_vmware.3\": failed to do request: Head \"https://<imageregistry>/etcd/manifests/v3.5.16_vmware.3\": dial tcp: lookup <imageregistry> on 127.0.0.53:53: read udp 127.0.0.1:32982->127.0.0.53:53: i/o timeout"
Dec 12 10:11:07 mgmt-cluster-controlplane-##### containerd[1088]: time="2025-12-12T10:11:07.500114475Z" level=info msg="stop pulling image <imageregistry>/etcd:v3.5.16_vmware.3: active requests=0, bytes read=0"


When running 'nslookup <imageregistry> <DNS_IP>' you see a message similar to below:

;; communications error to ###.##.###.###53: timed out
;; no servers could be reached


When checking nslookup command the timed out message is observed for both DNS IPs configured as per resolvectl

 

Environment

TKGm 2.5.4
AVI

Cause

The management cluster control plane is stuck creating as it is unable to pull images from the configured imageregistry, therefore kubelet cannot be brought to running state, etcd image cannot be obtained for etcd image in order to configure etcd etc.

The kubelet is in activating state instead of activated and Running. The /var/lib/kubelet/config.yaml cannot be obtained and therefore cannot load. Other required images like etcd can also not be obtained or brought to Running state.

PullImage error is failing because DNS IPs configured are unable to resolve the imageregistry's FQDN and therefore cannot pull required images for the management cluster's control plane nodes to complete their bootstrap.

Resolution

Have your network team look into communication issue between your management cluster's network and your DNS and allow this communication.

Then delete and recreate the management cluster if containerd of management cluster control plane node does not progress the PullImage.