Tanzu Kubernetes Grid management cluster creation fails with cert-manager pods stuck in a pending state

search cancel

Tanzu Kubernetes Grid management cluster creation fails with cert-manager pods stuck in a pending state

book

Article ID: 317455

calendar_today

Updated On: 07-01-2024

Products

VMware Tanzu Kubernetes Grid Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

Tanzu Kubernetes Grid (TKG) 1.2.1 deployment is stuck at 4/8 step
You are deploying TKG in an air-gapped environment
When you run kubectl -n cert-manager get po you see that all pods are in a pending state
When you run kubectl -n cert-manager describe po <pod name> you see messages similar to the following:

0/X nodes are available: X node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate

You see messages similar to the following in the output from the tkg init command:

I0120 12:17:46.445757 cert_manager.go:453] Waiting for cert-manager to be available...
I0120 12:17:46.456910 cert_manager.go:419] Updating Namespace="cert-manager-test"
I0120 12:17:46.603389 cert_manager.go:411] Creating Issuer="test-selfsigned" Namespace="cert-manager-test"

When you log on to one of the control plane nodes and check the states of the nodes in the cluster (kubectl --kubeconfig /etc/kubernetes/admin.conf get no -o wide), you see that there are no worker nodes joined yet.
When you log on to one of the worker nodes, you see messages similar to the following in journalctl output, indicating an issue with name resolution.

Feb 09 07:22:15 tkg-kind-c0772pe440qsvjjovfbg-control-plane kubelet[733]: E0125 07:22:15.698512 733 remote_image.go:113] PullImage "registry.domain.local/newapp/cert-manager/cert-manager-controller:v0.16.1_vmware.1" from image service failed: rpc error: code = Unknown desc = failed to pull and unpack image "registry.domain.local/newapp/cert-manager/cert-manager-controller:v0.16.1_vmware.1": failed to resolve reference

When you log on to one of the worker nodes and check contents the /etc/resolv.conf file you don't see the proper nameserver entries. When you check the contents of the same file on a control plane node, it has the proper nameserver entries.
You have used a ytt overlay file at ~/.tkg/providers/infrastructure-vsphere/ytt/namesever.yml to specify the nameserver entries in the /etc/resolv.conf file.

Environment

VMware Tanzu Kubernetes Grid Plus 1.x
VMware Tanzu Kubernetes Grid 1.x

Resolution

You can use a ytt overlay file similar to the following to ensure that the proper nameserver entries are added to the /etc/resolv.conf file on all worker nodes:

~/.tkg/providers/infrastructure-vsphere/ytt/namesever.yml

#@overlay/match by=overlay.subset({"kind":"KubeadmControlPlane"})
---
spec:
kubeadmConfigSpec:
preKubeadmCommands:
#! Add nameserver to all k8s nodes
#@overlay/append
- echo 'nameserver 192.168.63.10' >> /usr/lib/systemd/resolv.conf
#! Add domain to all k8s nodes
#@overlay/append
- echo 'domain billdesk.local' >> /usr/lib/systemd/resolv.conf
#! Remove the resolv.conf from all the K8s nodes
#@overlay/append
- rm /etc/resolv.conf
#! Create a new link file
#@overlay/append
- ln -s /usr/lib/systemd/resolv.conf /etc/resolv.conf
#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"})
---
spec:
template:
spec:
preKubeadmCommands:
#! Add nameserver to all k8s nodes
#@overlay/append
- echo 'nameserver 192.168.63.10' >> /usr/lib/systemd/resolv.conf
#! Add domain to all k8s nodes
#@overlay/append
- echo 'domain billdesk.local' >> /usr/lib/systemd/resolv.conf

Feedback

thumb_up Yes

thumb_down No