[Troubleshooting] Containers in a TKGI cluster cannot resolve internal BOSH DNS FQDNs of services colocated in the environment
search cancel

[Troubleshooting] Containers in a TKGI cluster cannot resolve internal BOSH DNS FQDNs of services colocated in the environment

book

Article ID: 298723

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

This article focuses on debugging Kubernetes pods with specific container images that are not able to resolve internal bosh DNS FQDNs of services running in a TKGI based environment.

In an environment where TKGI is installed along with external service tiles (for example VMware Tanzu RabbitMQ for VMs), some Kubernetes pods are not able to resolve internal Bosh DNS FQDNs (for example - q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh)

The steps in this article will help to identify that bosh DNS is not the cause of the failures in DNS resolution


Symptoms:

The following log messages show one scenario where the DNS resolutions are failing to resolve internal bosh FQDN for a RabbitMQ server where the pod used a busybox container image. 

> kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
If you don't see a command prompt, try pressing enter.

/ # nslookup q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh
Server: 10.100.200.10
Address: 10.100.200.10:53

*** Can't find q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh: No answer

*** Can't find q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh: No answer 

 

Environment

Product Version: 1.11
OS: Linux

Cause

Based on the details in the Symptom & Checklist sections, the issue that was reported to the support team was identified to be caused by the nslookup utility packaged with a busybox container image. Other container images may exhibit similar behavior. 

Resolution

The example in this article shows that the nslookup utility packaged with the busybox image was problematic and other utilities like ping were able to successfully resolve internal bosh DNS FQDNs.

Container images must be validated before use. It is outside the scope of support to validate/test the behavior of packaged utilities in the public container images. 

If you see different behavior than what was reported in this article, feel free to create a new case with the Tanzu support team.


Checklist:

Make sure the worker node(s) in TKGI can resolve internal bosh FQDN(s). For example:

$ nslookup q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh
Server:         169.254.0.2
Address:        169.254.0.2#53

Name:   q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh
Address: 10.47.110.187 

 

Validate by using different container image(s) if you see the same behavior across all the images. For example - when this issue was reported to the support team, we used the instructions from the OSS Kubernetes page. Instructions for setting up the dnsutils pods are present on the linked page:
 

* kubectl exec -i -t dnsutils -- nslookup kubernetes.default
* kubectl exec -i -t dnsutils -- nslookup q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh 

 

Use different utilities (for example - ping) with the problematic container image to see if those utilities can resolve the internal bosh DNS FQDNs. When this issue was reported to the support team, we found that the ping was able to successfully resolve the FQDN while nslookup was not able to do DNS resolution when the busybox container image was used. 

kubectl exec -i -t dnsutils --  ping q-s0.rabbitmq-server.service-network.service-instance-<ID>.bosh 

 

You can also validate whether public URLs can be resolved from the container image or not. Try to identify the utility where you see failures & one where you do not see the same behavior (for example, ping, nslookup, etc)