Symptoms:
DNS resolution is suddenly timed out from the pod in TKGi, output from inside a pod:
pod-test:~# nslookup api.test.cluster.example.com <---------Replace with your cluster API server FQDN
;; communications error to 1#.1##.2##.2#53: timed out pod-test:~# digapi.test.cluster.example.com <---------Replace with your cluster API server FQDN
;; communications error to 1#.1##.2##.2#53: timed out pod-test:~# nslookup www.google.com <---------FQDN of reachable website
;; communications error to 1#.1##.2##.2#53: timed out
coredns service and pods are used for the DNS resolution as per resolv.conf inside the pod
pod-test:~# cat /etc/resolv.conf search default.svc.cluster.local svc.cluster.local cluster.local nameserver ##.###.###.# <-------IP of DNS server options ndots:5
Even if the DNS servers configured are up and running and reachable from the pod, a pod will try to use the coredns service first
Review the health of coredns pods and service.
1) Check if the coredns service exists
$ kubectl get svc --namespace=kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE antrea ClusterIP 1#.1##.2##.# <none> 443/TCP 6h24m kube-dns ClusterIP 1#.1##.2##.# <none> 53/UDP,53/TCP,9153/TCP 6h16m metrics-server ClusterIP 1#.1##.2##.1## <none> 443/TCP 6h16m
2) Check if the coredns pods are running
$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns NAME READY STATUS RESTARTS AGE coredns-6f5c7f675f-qbh8t 1/1 Running 0 29m
3) Check the logs inside coredns pods
$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns .:53 [INFO] plugin/reload: Running configuration MD5 = 1d534941ad8884bb215680f48f8f5d2c CoreDNS-1.8.6 linux/amd64, go1.19.5, v1.8.6+vmware.17
If the coredns pods are missing for a specifc TKGi cluster, you can re-push them by running the errand apply-addon with Bosh, where UUID can be retrieved from tkgi clusters command
bosh -d service-instance_UUID run-errand apply-addons