Unable to Create cluster in PKS: pre-start scripts failed related to DNS resolution.
search cancel

Unable to Create cluster in PKS: pre-start scripts failed related to DNS resolution.

book

Article ID: 316829

calendar_today

Updated On:

Products

VMware Cloud PKS

Issue/Introduction

Symptoms:
  • You see that bosh cluster creation tasks fail with prestart script messages similar to the following:
Task 1622 | 17:15:17 | Updating instance master: master/61c97315-7d6a-40fb-9ff9-69cb76bb776e (0) (canary) (00:02:10) L Error: Action Failed get_task: Task d22e5fef-ee12-4951-58fa-bdbf205a42e2 result: 2 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm, pks-nsx-t-ncp. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp. Task 1622 | 17:17:27 | Error: Action Failed get_task: Task d22e5fef-ee12-4951-58fa-bdbf205a42e2 result: 2 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm, pks-nsx-t-ncp. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp.
 
  • While accessing the cluster, you see that a master or worker has failed to start
  • After gathering bosh deployments logs,you see messages similar to the following in /<service-instanceID>/pks-nsx-t-prepare-master-vm/pre-start.stdout.log:
Registering client certificate Get https://<NSX-MANAGER-FQDN>/api/v1/trust-management/principal-identities: dial tcp: lookup <NSX-MANAGER-FQDN> on <DNS-SERVER-IP>:53: read udp <BOSH-VM-IP>:59503-><DNS-SERVER-IP:53: i/o timeout
  • You see messages similar to the following in the deployment logs under /<service-instanceID>/bosh-dns/bosh_dns.stdout.log:
[FailoverRecursor] 2020/08/21 19:19:27 INFO - shifting recursor preference: <DNS-SERVER-IP>
[ForwardHandler] 2020/08/21 19:19:27 DEBUG - error recursing to "<DNS-SERVER-IP>:53": read udp <BOSH-VM-IP>:59882-><DNS-SERVER-IP>:53: i/o timeout
[FailoverRecursor] 2020/08/21 19:19:27 INFO - shifting recursor preference: <DNS-SERVER-IP>:53
[ForwardHandler] 2020/08/21 19:19:29 DEBUG - error recursing to "<DNS-SERVER-IP>": read udp <BOSH-VM-IP>:54326-><DNS-SERVER-IP>: i/o timeout
[ForwardHandler] 2020/08/21 19:19:29 INFO - handlers.ForwardHandler Request [1] [<DNS-SERVER-FQDN>.] 2 [no response from recursors] 4000914000ns
[ForwardHandler] 2020/08/21 19:19:29 DEBUG - error recursing to "<DNS-SERVER-IP>": read udp <BOSH-VM-IP>:43965-><DNS-SERVER-IP>: i/o timeout
[ForwardHandler] 2020/08/21 19:19:29 INFO - handlers.ForwardHandler Request [28] [<DNS-SERVER-FQDN>.] 2 [no response from recursors] 4001378000ns

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware PKS 1.x

Cause

  • This issue can be caused by a DNS resolution timeout. More investigation on the customer-side infrastructure and DNS server troubleshooting will need to be accomplished to identify why name resolutions are timing out.
  • When boshDNS tries to lookup a name via a recursor (DNS server or servers) configured in the Bosh tile there is a timeout value BoshDNS has to resolve by. Typically this timeout value is set in PKS in Operations Manager -> Director -> BOSH DNS > Recursor Timeout. The default is typically 5 seconds per recursor.
Note: This issue has been seen to occur when an NSX-T edge node's password is expired and the edge(s) are in an unresponsive state. In this scenario the cluster provisioning starts but NSX-T does not put the SNAT rules in place and starts sending traffic as the internal IPs causing DNS resolution timeouts and failures.

Resolution

  • Identify the networking/configuration issue that is causing slow DNS name resolution in the environment, either in the physical environment or the NSX-T infrastructure.


Workaround:
  • If environmental causes i.e. firewalls are the cause to slow DNS resolution and there is no way to speed up name resolution, attempt to increase the Recursor Timeout at: PKS in Operations Manager > Director > BOSH DNS > Recursor Timeout